Re: [PATCH 0/3] Check gso_size of packets when forwarding
On Thu, Jan 18, 2018 at 5:28 PM, Daniel Axtenswrote: > Pravin Shelar writes: > >> On Thu, Jan 18, 2018 at 5:08 AM, Daniel Axtens wrote: >>> Pravin Shelar writes: >>> On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens wrote: > When regular packets are forwarded, we validate their size against the > MTU of the destination device. However, when GSO packets are > forwarded, we do not validate their size against the MTU. We > implicitly assume that when they are segmented, the resultant packets > will be correctly sized. > > This is not always the case. > > We observed a case where a packet received on an ibmveth device had a > GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x > device, where it caused a firmware assert. This is described in detail > at [0] and was the genesis of this series. Rather than fixing it in > the driver, this series fixes the forwarding path. > Are there any other possible forwarding path in networking stack? TC is one subsystem that could forward such a packet to the bnx2x device, how is that handled ? >>> >>> So far I have only looked at bridges, openvswitch and macvlan. In >>> general, if the code uses dev_forward_skb() it should automatically be >>> fine as that invokes is_skb_forwardable(), which we patch. >>> >> But there are other ways to forward packets, e.g tc-mirred or bpf >> redirect. We need to handle all these cases rather than fixing one at >> a time. As Jason suggested netif_needs_gso() looks like good function >> to validate if a device is capable of handling given GSO packet. > > I am not entirely sure this is a better solution. > > The biggest reason I am uncomfortable with this is that if > netif_needs_gso() returns true, the skb will be segmented. The segment > sizes will be based on gso_size. Since gso_size is greater than the MTU, > the resulting segments will themselves be over-MTU. Those over-MTU > segments will then be passed to the network card. I think we should not > be creating over-MTU segments; we should instead be dropping the packet > and logging. > Would this oversized segment cause the firmware assert? If this solves the assert issue then I do not see much value in adding checks in fast-path just for logging.
[PATCH V2 net-next 1/4] net: hns3: add support for get_regs
From: Fuyun LiangThis patch adds get_regs support for ethtool cmd. Signed-off-by: Fuyun Liang Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hnae3.h| 3 +- drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 23 +++ .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h | 4 + .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 176 + 4 files changed, 205 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index 634e932..d104ce5 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -356,7 +356,8 @@ struct hnae3_ae_ops { u32 stringset, u8 *data); int (*get_sset_count)(struct hnae3_handle *handle, int stringset); - void (*get_regs)(struct hnae3_handle *handle, void *data); + void (*get_regs)(struct hnae3_handle *handle, u32 *version, +void *data); int (*get_regs_len)(struct hnae3_handle *handle); u32 (*get_rss_key_size)(struct hnae3_handle *handle); diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c index 358f780..1c8b293 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c @@ -1063,6 +1063,27 @@ static int hns3_set_coalesce(struct net_device *netdev, return 0; } +static int hns3_get_regs_len(struct net_device *netdev) +{ + struct hnae3_handle *h = hns3_get_handle(netdev); + + if (!h->ae_algo->ops->get_regs_len) + return -EOPNOTSUPP; + + return h->ae_algo->ops->get_regs_len(h); +} + +static void hns3_get_regs(struct net_device *netdev, + struct ethtool_regs *cmd, void *data) +{ + struct hnae3_handle *h = hns3_get_handle(netdev); + + if (!h->ae_algo->ops->get_regs) + return; + + h->ae_algo->ops->get_regs(h, >version, data); +} + static const struct ethtool_ops hns3vf_ethtool_ops = { .get_drvinfo = hns3_get_drvinfo, .get_ringparam = hns3_get_ringparam, @@ -1103,6 +1124,8 @@ static const struct ethtool_ops hns3_ethtool_ops = { .set_channels = hns3_set_channels, .get_coalesce = hns3_get_coalesce, .set_coalesce = hns3_set_coalesce, + .get_regs_len = hns3_get_regs_len, + .get_regs = hns3_get_regs, }; void hns3_ethtool_set_ops(struct net_device *netdev) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h index 3c3159b..2561e7a 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h @@ -102,6 +102,10 @@ enum hclge_opcode_type { HCLGE_OPC_STATS_64_BIT = 0x0030, HCLGE_OPC_STATS_32_BIT = 0x0031, HCLGE_OPC_STATS_MAC = 0x0032, + + HCLGE_OPC_QUERY_REG_NUM = 0x0040, + HCLGE_OPC_QUERY_32_BIT_REG = 0x0041, + HCLGE_OPC_QUERY_64_BIT_REG = 0x0042, /* Device management command */ /* MAC commond */ diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 27f0ab6..c3d2cca 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -5544,6 +5544,180 @@ static int hclge_set_channels(struct hnae3_handle *handle, u32 new_tqps_num) return ret; } +static int hclge_get_regs_num(struct hclge_dev *hdev, u32 *regs_num_32_bit, + u32 *regs_num_64_bit) +{ + struct hclge_desc desc; + u32 total_num; + int ret; + + hclge_cmd_setup_basic_desc(, HCLGE_OPC_QUERY_REG_NUM, true); + ret = hclge_cmd_send(>hw, , 1); + if (ret) { + dev_err(>pdev->dev, + "Query register number cmd failed, ret = %d.\n", ret); + return ret; + } + + *regs_num_32_bit = le32_to_cpu(desc.data[0]); + *regs_num_64_bit = le32_to_cpu(desc.data[1]); + + total_num = *regs_num_32_bit + *regs_num_64_bit; + if (!total_num) + return -EINVAL; + + return 0; +} + +static int hclge_get_32_bit_regs(struct hclge_dev *hdev, u32 regs_num, +void *data) +{ +#define HCLGE_32_BIT_REG_RTN_DATANUM 8 + + struct hclge_desc *desc; + u32 *reg_val = data; + __le32 *desc_data; + int cmd_num; + int i, k, n; + int ret; + + if (regs_num == 0) + return 0; + + cmd_num = DIV_ROUND_UP(regs_num + 2, HCLGE_32_BIT_REG_RTN_DATANUM); + desc = kcalloc(cmd_num, sizeof(struct hclge_desc), GFP_KERNEL); + if (!desc)
[PATCH V2 net-next 2/4] net: hns3: add manager table initialization for hardware
From: Fuyun LiangThe manager table is empty by default. If it is not initialized, the management pkgs like LLDP will be dropped by hardware. Default entries need to be added to manager table. Signed-off-by: Fuyun Liang Signed-off-by: Peng Li --- .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h | 22 + .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 101 + 2 files changed, 123 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h index 2561e7a..1cd28e0 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h @@ -605,6 +605,28 @@ struct hclge_mac_vlan_mask_entry_cmd { u8 rsv2[14]; }; +#define HCLGE_MAC_MGR_MASK_VLAN_B BIT(0) +#define HCLGE_MAC_MGR_MASK_MAC_B BIT(1) +#define HCLGE_MAC_MGR_MASK_ETHERTYPE_B BIT(2) +#define HCLGE_MAC_ETHERTYPE_LLDP 0x88cc + +struct hclge_mac_mgr_tbl_entry_cmd { + u8 flags; + u8 resp_code; + __le16 vlan_tag; + __le32 mac_addr_hi32; + __le16 mac_addr_lo16; + __le16 rsv1; + __le16 ethter_type; + __le16 egress_port; + __le16 egress_queue; + u8 sw_port_id_aware; + u8 rsv2; + u8 i_port_bitmap; + u8 i_port_direction; + u8 rsv3[2]; +}; + #define HCLGE_CFG_MTA_MAC_SEL_S0x0 #define HCLGE_CFG_MTA_MAC_SEL_MGENMASK(1, 0) #define HCLGE_CFG_MTA_MAC_EN_B 0x7 diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index c3d2cca..6e64bed 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -392,6 +392,16 @@ static const struct hclge_comm_stats_str g_mac_stats_string[] = { HCLGE_MAC_STATS_FIELD_OFF(mac_rx_send_app_bad_pkt_num)} }; +static const struct hclge_mac_mgr_tbl_entry_cmd hclge_mgr_table[] = { + { + .flags = HCLGE_MAC_MGR_MASK_VLAN_B, + .ethter_type = cpu_to_le16(HCLGE_MAC_ETHERTYPE_LLDP), + .mac_addr_hi32 = cpu_to_le32(htonl(0x0180C200)), + .mac_addr_lo16 = cpu_to_le16(htons(0x000E)), + .i_port_bitmap = 0x1, + }, +}; + static int hclge_64_bit_update_stats(struct hclge_dev *hdev) { #define HCLGE_64_BIT_CMD_NUM 5 @@ -4249,6 +4259,91 @@ int hclge_rm_mc_addr_common(struct hclge_vport *vport, return status; } +static int hclge_get_mac_ethertype_cmd_status(struct hclge_dev *hdev, + u16 cmdq_resp, u8 resp_code) +{ +#define HCLGE_ETHERTYPE_SUCCESS_ADD0 +#define HCLGE_ETHERTYPE_ALREADY_ADD1 +#define HCLGE_ETHERTYPE_MGR_TBL_OVERFLOW 2 +#define HCLGE_ETHERTYPE_KEY_CONFLICT 3 + + int return_status; + + if (cmdq_resp) { + dev_err(>pdev->dev, + "cmdq execute failed for get_mac_ethertype_cmd_status, status=%d.\n", + cmdq_resp); + return -EIO; + } + + switch (resp_code) { + case HCLGE_ETHERTYPE_SUCCESS_ADD: + case HCLGE_ETHERTYPE_ALREADY_ADD: + return_status = 0; + break; + case HCLGE_ETHERTYPE_MGR_TBL_OVERFLOW: + dev_err(>pdev->dev, + "add mac ethertype failed for manager table overflow.\n"); + return_status = -EIO; + break; + case HCLGE_ETHERTYPE_KEY_CONFLICT: + dev_err(>pdev->dev, + "add mac ethertype failed for key conflict.\n"); + return_status = -EIO; + break; + default: + dev_err(>pdev->dev, + "add mac ethertype failed for undefined, code=%d.\n", + resp_code); + return_status = -EIO; + } + + return return_status; +} + +static int hclge_add_mgr_tbl(struct hclge_dev *hdev, +const struct hclge_mac_mgr_tbl_entry_cmd *req) +{ + struct hclge_desc desc; + u8 resp_code; + u16 retval; + int ret; + + hclge_cmd_setup_basic_desc(, HCLGE_OPC_MAC_ETHTYPE_ADD, false); + memcpy(desc.data, req, sizeof(struct hclge_mac_mgr_tbl_entry_cmd)); + + ret = hclge_cmd_send(>hw, , 1); + if (ret) { + dev_err(>pdev->dev, + "add mac ethertype failed for cmd_send, ret =%d.\n", + ret); + return ret; + } + + resp_code = (le32_to_cpu(desc.data[0]) >> 8) & 0xff; + retval = le16_to_cpu(desc.retval); + + return hclge_get_mac_ethertype_cmd_status(hdev, retval, resp_code);
[PATCH V2 net-next 0/4] add some features to hns3 driver
This patchset adds some features to hns3 driver, include the support for ethtool command -d, -p and support for manager table. [Patch 1/4] adds support for ethtool command -d, its ops is get_regs. driver will send command to command queue, and get regs number and regs value from command queue. [Patch 2/4] adds manager table initialization for hardware. [Patch 3/4] adds support for ethtool command -p. For fiber ports, driver sends command to command queue, and IMP will write SGPIO regs to control leds. [Patch 4/4] adds support for net status led for fiber ports. Net status include port speed, total rx/tx packets and link status. Driver send the status to command queue, and IMP will write SGPIO to control leds. --- Change log: V1 -> V2: 1, fix comments from Andrew Lunn, remove the patch "net: hns3: add ethtool -p support for phy device". --- Fuyun Liang (2): net: hns3: add support for get_regs net: hns3: add manager table initialization for hardware Jian Shen (2): net: hns3: add ethtool -p support for fiber port net: hns3: add net status led support for fiber port drivers/net/ethernet/hisilicon/hns3/hnae3.h| 5 +- drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 35 ++ .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h | 47 +++ .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 456 + .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 3 + 5 files changed, 545 insertions(+), 1 deletion(-) -- 2.9.3
[PATCH V2 net-next 4/4] net: hns3: add net status led support for fiber port
From: Jian ShenCheck the net status per second, include port speed, total rx/tx packets and link status. Updating the led status for fiber port. Signed-off-by: Jian Shen Signed-off-by: Peng Li --- .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h | 1 + .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 109 + .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 3 + 3 files changed, 113 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h index 122f862..3fd10a6 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h @@ -115,6 +115,7 @@ enum hclge_opcode_type { HCLGE_OPC_QUERY_LINK_STATUS = 0x0307, HCLGE_OPC_CONFIG_MAX_FRM_SIZE = 0x0308, HCLGE_OPC_CONFIG_SPEED_DUP = 0x0309, + HCLGE_OPC_STATS_MAC_TRAFFIC = 0x0314, /* MACSEC command */ /* PFC/Pause CMD*/ diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 12150f2..32bc6f6 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -39,6 +39,7 @@ static int hclge_set_mta_filter_mode(struct hclge_dev *hdev, static int hclge_set_mtu(struct hnae3_handle *handle, int new_mtu); static int hclge_init_vlan_config(struct hclge_dev *hdev); static int hclge_reset_ae_dev(struct hnae3_ae_dev *ae_dev); +static int hclge_update_led_status(struct hclge_dev *hdev); static struct hnae3_ae_algo ae_algo; @@ -505,6 +506,38 @@ static int hclge_32_bit_update_stats(struct hclge_dev *hdev) return 0; } +static int hclge_mac_get_traffic_stats(struct hclge_dev *hdev) +{ + struct hclge_mac_stats *mac_stats = >hw_stats.mac_stats; + struct hclge_desc desc; + __le64 *desc_data; + int ret; + + /* for fiber port, need to query the total rx/tx packets statstics, +* used for data transferring checking. +*/ + if (hdev->hw.mac.media_type != HNAE3_MEDIA_TYPE_FIBER) + return 0; + + if (test_bit(HCLGE_STATE_STATISTICS_UPDATING, >state)) + return 0; + + hclge_cmd_setup_basic_desc(, HCLGE_OPC_STATS_MAC_TRAFFIC, true); + ret = hclge_cmd_send(>hw, , 1); + if (ret) { + dev_err(>pdev->dev, + "Get MAC total pkt stats fail, ret = %d\n", ret); + + return ret; + } + + desc_data = (__le64 *)([0]); + mac_stats->mac_tx_total_pkt_num += le64_to_cpu(*desc_data++); + mac_stats->mac_rx_total_pkt_num += le64_to_cpu(*desc_data); + + return 0; +} + static int hclge_mac_update_stats(struct hclge_dev *hdev) { #define HCLGE_MAC_CMD_NUM 21 @@ -2846,13 +2879,20 @@ static void hclge_service_task(struct work_struct *work) struct hclge_dev *hdev = container_of(work, struct hclge_dev, service_task); + /* The total rx/tx packets statstics are wanted to be updated +* per second. Both hclge_update_stats_for_all() and +* hclge_mac_get_traffic_stats() can do it. +*/ if (hdev->hw_stats.stats_timer >= HCLGE_STATS_TIMER_INTERVAL) { hclge_update_stats_for_all(hdev); hdev->hw_stats.stats_timer = 0; + } else { + hclge_mac_get_traffic_stats(hdev); } hclge_update_speed_duplex(hdev); hclge_update_link_status(hdev); + hclge_update_led_status(hdev); hclge_service_complete(hdev); } @@ -5888,6 +5928,75 @@ static int hclge_set_led_id(struct hnae3_handle *handle, return ret; } +enum hclge_led_port_speed { + HCLGE_SPEED_LED_FOR_1G, + HCLGE_SPEED_LED_FOR_10G, + HCLGE_SPEED_LED_FOR_25G, + HCLGE_SPEED_LED_FOR_40G, + HCLGE_SPEED_LED_FOR_50G, + HCLGE_SPEED_LED_FOR_100G, +}; + +static u8 hclge_led_get_speed_status(u32 speed) +{ + u8 speed_led; + + switch (speed) { + case HCLGE_MAC_SPEED_1G: + speed_led = HCLGE_SPEED_LED_FOR_1G; + break; + case HCLGE_MAC_SPEED_10G: + speed_led = HCLGE_SPEED_LED_FOR_10G; + break; + case HCLGE_MAC_SPEED_25G: + speed_led = HCLGE_SPEED_LED_FOR_25G; + break; + case HCLGE_MAC_SPEED_40G: + speed_led = HCLGE_SPEED_LED_FOR_40G; + break; + case HCLGE_MAC_SPEED_50G: + speed_led = HCLGE_SPEED_LED_FOR_50G; + break; + case HCLGE_MAC_SPEED_100G: + speed_led = HCLGE_SPEED_LED_FOR_100G; + break; + default: + speed_led = HCLGE_LED_NO_CHANGE; + } + + return speed_led; +} + +static int hclge_update_led_status(struct
[PATCH V2 net-next 3/4] net: hns3: add ethtool -p support for fiber port
From: Jian ShenAdd led location support for fiber port. The led will keep blinking when locating. Signed-off-by: Jian Shen Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hnae3.h| 2 + drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 12 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h | 20 +++ .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 70 ++ 4 files changed, 104 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index d104ce5..fd06bc7 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -405,6 +405,8 @@ struct hnae3_ae_ops { int (*set_channels)(struct hnae3_handle *handle, u32 new_tqps_num); void (*get_flowctrl_adv)(struct hnae3_handle *handle, u32 *flowctrl_adv); + int (*set_led_id)(struct hnae3_handle *handle, + enum ethtool_phys_id_state status); }; struct hnae3_dcb_ops { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c index 1c8b293..7410205 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c @@ -1084,6 +1084,17 @@ static void hns3_get_regs(struct net_device *netdev, h->ae_algo->ops->get_regs(h, >version, data); } +static int hns3_set_phys_id(struct net_device *netdev, + enum ethtool_phys_id_state state) +{ + struct hnae3_handle *h = hns3_get_handle(netdev); + + if (!h->ae_algo || !h->ae_algo->ops || !h->ae_algo->ops->set_led_id) + return -EOPNOTSUPP; + + return h->ae_algo->ops->set_led_id(h, state); +} + static const struct ethtool_ops hns3vf_ethtool_ops = { .get_drvinfo = hns3_get_drvinfo, .get_ringparam = hns3_get_ringparam, @@ -1126,6 +1137,7 @@ static const struct ethtool_ops hns3_ethtool_ops = { .set_coalesce = hns3_set_coalesce, .get_regs_len = hns3_get_regs_len, .get_regs = hns3_get_regs, + .set_phys_id = hns3_set_phys_id, }; void hns3_ethtool_set_ops(struct net_device *netdev) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h index 1cd28e0..122f862 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h @@ -227,6 +227,9 @@ enum hclge_opcode_type { /* Mailbox cmd */ HCLGEVF_OPC_MBX_PF_TO_VF= 0x2000, + + /* Led command */ + HCLGE_OPC_LED_STATUS_CFG= 0xB000, }; #define HCLGE_TQP_REG_OFFSET 0x8 @@ -807,6 +810,23 @@ struct hclge_reset_cmd { #define HCLGE_NIC_CMQ_DESC_NUM 1024 #define HCLGE_NIC_CMQ_DESC_NUM_S 3 +#define HCLGE_LED_PORT_SPEED_STATE_S 0 +#define HCLGE_LED_PORT_SPEED_STATE_M GENMASK(5, 0) +#define HCLGE_LED_ACTIVITY_STATE_S 0 +#define HCLGE_LED_ACTIVITY_STATE_M GENMASK(1, 0) +#define HCLGE_LED_LINK_STATE_S 0 +#define HCLGE_LED_LINK_STATE_M GENMASK(1, 0) +#define HCLGE_LED_LOCATE_STATE_S 0 +#define HCLGE_LED_LOCATE_STATE_M GENMASK(1, 0) + +struct hclge_set_led_state_cmd { + u8 port_speed_led_config; + u8 link_led_config; + u8 activity_led_config; + u8 locate_led_config; + u8 rsv[20]; +}; + int hclge_cmd_init(struct hclge_dev *hdev); static inline void hclge_write_reg(void __iomem *base, u32 reg, u32 value) { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 6e64bed..12150f2 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -5819,6 +5819,75 @@ static void hclge_get_regs(struct hnae3_handle *handle, u32 *version, "Get 64 bit register failed, ret = %d.\n", ret); } +static int hclge_set_led_status_sfp(struct hclge_dev *hdev, u8 speed_led_status, + u8 act_led_status, u8 link_led_status, + u8 locate_led_status) +{ + struct hclge_set_led_state_cmd *req; + struct hclge_desc desc; + int ret; + + hclge_cmd_setup_basic_desc(, HCLGE_OPC_LED_STATUS_CFG, false); + + req = (struct hclge_set_led_state_cmd *)desc.data; + hnae_set_field(req->port_speed_led_config, HCLGE_LED_PORT_SPEED_STATE_M, + HCLGE_LED_PORT_SPEED_STATE_S, speed_led_status); + hnae_set_field(req->link_led_config, HCLGE_LED_ACTIVITY_STATE_M, + HCLGE_LED_ACTIVITY_STATE_S, act_led_status); + hnae_set_field(req->activity_led_config, HCLGE_LED_LINK_STATE_M, +
Re: [PATCH 0/3] Check gso_size of packets when forwarding
Daniel Axtenswrites: > Pravin Shelar writes: > >> On Thu, Jan 18, 2018 at 5:08 AM, Daniel Axtens wrote: >>> Pravin Shelar writes: >>> On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens wrote: > When regular packets are forwarded, we validate their size against the > MTU of the destination device. However, when GSO packets are > forwarded, we do not validate their size against the MTU. We > implicitly assume that when they are segmented, the resultant packets > will be correctly sized. > > This is not always the case. > > We observed a case where a packet received on an ibmveth device had a > GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x > device, where it caused a firmware assert. This is described in detail > at [0] and was the genesis of this series. Rather than fixing it in > the driver, this series fixes the forwarding path. > Are there any other possible forwarding path in networking stack? TC is one subsystem that could forward such a packet to the bnx2x device, how is that handled ? >>> >>> So far I have only looked at bridges, openvswitch and macvlan. In >>> general, if the code uses dev_forward_skb() it should automatically be >>> fine as that invokes is_skb_forwardable(), which we patch. >>> >> But there are other ways to forward packets, e.g tc-mirred or bpf >> redirect. We need to handle all these cases rather than fixing one at >> a time. As Jason suggested netif_needs_gso() looks like good function >> to validate if a device is capable of handling given GSO packet. > > I am not entirely sure this is a better solution. > > The biggest reason I am uncomfortable with this is that if > netif_needs_gso() returns true, the skb will be segmented. The segment > sizes will be based on gso_size. Since gso_size is greater than the MTU, > the resulting segments will themselves be over-MTU. Those over-MTU > segments will then be passed to the network card. I think we should not > be creating over-MTU segments; we should instead be dropping the packet > and logging. > > I do take the point that you and Jason are making: a more generic > fix would be good. I'm just not sure where to put it. How about this? It puts the check in validate_xmit_skb(). (completely untested) Ultimately I think we need to make a broader policy decision about who is responsible for ensuring that packets are correctly sized - is it the caller of dev_queue_xmit (as in bridges and openswitch) or is it lower down the stack, such as validate_xmit_skb()? Making the caller check is more efficient - packets created on the system are always going to fit due to the checks in the protocol code, so rechecking is inefficient and unnecessary. But checking later is more reliable as we don't have to identify all forwarding paths. Regards, Daniel diff --git a/net/core/dev.c b/net/core/dev.c index 5af04be128c1..b8c3f33ece19 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1830,13 +1830,11 @@ static inline void net_timestamp_set(struct sk_buff *skb) __net_timestamp(SKB); \ } \ -bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) +static inline bool does_skb_fit_mtu(const struct net_device *dev, + const struct sk_buff *skb) { unsigned int len; - if (!(dev->flags & IFF_UP)) - return false; - len = dev->mtu + dev->hard_header_len + VLAN_HLEN; if (skb->len <= len) return true; @@ -1850,6 +1848,14 @@ bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) return false; } + +bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb) +{ + if (!(dev->flags & IFF_UP)) + return false; + + return does_skb_fit_mtu(dev, skb); +} EXPORT_SYMBOL_GPL(is_skb_forwardable); int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) @@ -3081,6 +3087,9 @@ static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device if (unlikely(!skb)) goto out_null; + if (unlikely(!does_skb_fit_mtu(dev, skb))) + goto out_kfree_skb; + if (netif_needs_gso(skb, features)) { struct sk_buff *segs; > > > Regards, > Daniel
Re: [RFC PATCH] e1000e: Remove Other from EIAC.
On 2018/01/18 18:42, Shrikrishna Khare wrote: > > > On Thu, 18 Jan 2018, Benjamin Poirier wrote: > > > On 2018/01/18 15:50, Benjamin Poirier wrote: > > > It was reported that emulated e1000e devices in vmware esxi 6.5 Build > > > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver > > > overrun interrupt bursts", v4.15-rc1). Some tracing shows that after > > > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other() > > > on emulated e1000e devices. In comparison, on real e1000e 82574 hardware, > > > icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation. > > > > > > Some experimentation showed that this flaw in vmware e1000e emulation can > > > be worked around by not setting Other in EIAC. This is how it was before > > > 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1). > > > > vmware folks, please comment. > > Thank you for bringing this to our attention. > > Using the reported build (ESX 6.5, 7526125) and 4.15.0-rc8+ kernel (which > has the said patch), I could bring up e1000e interface (version: 3.2.6-k), > get dhcp address and even do large file downloads without difficulty. > > Could you give us more pointers on how we may be able to reproduce this > locally? Was there anything different with the configuration when the > issue was observed? Is the issue consistently reproducible? It's consistently reproducible, however I noticed that once in a while there is a genuine "Other" interrupt that comes in and triggers the link status change. The problem is with interrupts that are triggered via a write to ICS (such as in e1000e_trigger_lsc()). Can you reproduce a problem if you do: ip link set ethX down ip link set ethX up If you're building your own kernel, you can add the following patch and cat /sys/kernel/debug/tracing/trace_pipe For me it shows on v4.15-rc8: <...>-2578 [000] 83527.938321: e1000e_trigger_lsc: trigger_lsc <...>-2578 [000] d.h. 83527.938398: e1000_msix_other: icr 0x0 With the patch that I submitted, it shows: wickedd-1329 [002] .N..20.123545: e1000e_trigger_lsc: trigger_lsc -0 [000] d.h.20.123630: e1000_msix_other: icr 0x8104 -0 [000] d.h.20.123654: e1000_msix_other: lsc -0 [000] d.h.20.123676: e1000_msix_other: mod_timer diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 9f18d39bdc8f..16620ce840fc 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -1918,22 +1918,29 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data) bool enable = true; icr = er32(ICR); + trace_printk("icr 0x%x\n", icr); + if (icr & E1000_ICR_RXO) { + trace_printk("rxo\n"); ew32(ICR, E1000_ICR_RXO); enable = false; /* napi poll will re-enable Other, make sure it runs */ if (napi_schedule_prep(>napi)) { + trace_printk("napi schedule\n"); adapter->total_rx_bytes = 0; adapter->total_rx_packets = 0; __napi_schedule(>napi); } } if (icr & E1000_ICR_LSC) { + trace_printk("lsc\n"); ew32(ICR, E1000_ICR_LSC); hw->mac.get_link_status = true; /* guard against interrupt when we're going down */ - if (!test_bit(__E1000_DOWN, >state)) + if (!test_bit(__E1000_DOWN, >state)) { + trace_printk("mod_timer\n"); mod_timer(>watchdog_timer, jiffies + 1); + } } if (enable && !test_bit(__E1000_DOWN, >state)) @@ -4221,6 +4228,8 @@ static void e1000e_trigger_lsc(struct e1000_adapter *adapter) { struct e1000_hw *hw = >hw; + trace_printk("trigger_lsc\n"); + if (adapter->msix_entries) ew32(ICS, E1000_ICS_LSC | E1000_ICS_OTHER); else
Re: [PATCH iproute2] devlink: Ignore unknown attributes
On 1/17/18 5:28 AM, Arkadi Sharshevsky wrote: > In case of extending the UAPI old packages would break. > > Signed-off-by: Arkadi Sharshevsky> --- > devlink/devlink.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/devlink/devlink.c b/devlink/devlink.c > index 39cda06..c9d1838 100644 > --- a/devlink/devlink.c > +++ b/devlink/devlink.c > @@ -343,7 +343,7 @@ static int attr_cb(const struct nlattr *attr, void *data) > int type; > > if (mnl_attr_type_valid(attr, DEVLINK_ATTR_MAX) < 0) > - return MNL_CB_ERROR; > + return MNL_CB_OK; > > type = mnl_attr_get_type(attr); > if (mnl_attr_validate(attr, devlink_policy[type]) < 0) > What's the point of calling mnl_attr_type_valid if you disregard a failure? you might as well not call mnl_attr_type_valid at all.
Re: iscsi target regression due to "tcp: remove prequeue support" patch
On 01/18/2018 09:10 AM, Florian Westphal wrote: >> It would indicate users providing their own ->sk_data_ready() callback >> must be responsible for waking up a kthread context blocked on >> sock_recvmsg(..., MSG_WAITALL), when a second ->sk_data_ready() is >> received before the first sock_recvmsg(..., MSG_WAITALL) completes. > > I agree, it looks like we need something like this? > (not even build tested): > > diff --git a/drivers/target/iscsi/iscsi_target_nego.c > b/drivers/target/iscsi/iscsi_target_nego.c > index b686e2ce9c0e..3723f8f419aa 100644 > --- a/drivers/target/iscsi/iscsi_target_nego.c > +++ b/drivers/target/iscsi/iscsi_target_nego.c > @@ -432,6 +432,9 @@ static void iscsi_target_sk_data_ready(struct sock *sk) > if (test_and_set_bit(LOGIN_FLAGS_READ_ACTIVE, >login_flags)) { > write_unlock_bh(>sk_callback_lock); > pr_debug("Got LOGIN_FLAGS_READ_ACTIVE=1, conn: %p \n", > conn); > + if (WARN_ON(iscsi_target_sk_data_ready == > conn->orig_data_ready)) > + return; > + conn->orig_data_ready(sk); > return; This allows iscsi login to work for me. I ran it against the target-pending for-next branch.
[PATCH net] net: qdisc_pkt_len_init() should be more robust
From: Eric DumazetWithout proper validation of DODGY packets, we might very well feed qdisc_pkt_len_init() with invalid GSO packets. tcp_hdrlen() might access out-of-bound data, so let's use skb_header_pointer() and proper checks. Whole story is described in commit d0c081b49137 ("flow_dissector: properly cap thoff field") We have the goal of validating DODGY packets earlier in the stack, so we might very well revert this fix in the future. Signed-off-by: Eric Dumazet Cc: Willem de Bruijn Cc: Jason Wang Reported-by: syzbot+9da69ebac7804...@syzkaller.appspotmail.com --- net/core/dev.c | 19 +++ 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 0e0ba36eeac9852b8df5ddd398dbc66ad6c83760..613fb4066be7bedbd3cd59ea9cf6eb7ef3100bd0 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3151,10 +3151,21 @@ static void qdisc_pkt_len_init(struct sk_buff *skb) hdr_len = skb_transport_header(skb) - skb_mac_header(skb); /* + transport layer */ - if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))) - hdr_len += tcp_hdrlen(skb); - else - hdr_len += sizeof(struct udphdr); + if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))) { + const struct tcphdr *th; + struct tcphdr _tcphdr; + + th = skb_header_pointer(skb, skb_transport_offset(skb), + sizeof(_tcphdr), &_tcphdr); + if (likely(th)) + hdr_len += __tcp_hdrlen(th); + } else { + struct udphdr _udphdr; + + if (skb_header_pointer(skb, skb_transport_offset(skb), + sizeof(_udphdr), &_udphdr)) + hdr_len += sizeof(struct udphdr); + } if (shinfo->gso_type & SKB_GSO_DODGY) gso_segs = DIV_ROUND_UP(skb->len - hdr_len,
Re: [PATCH-next] flow_netlink: Remove unneeded semicolons
El jue, 18-01-2018 a las 16:25 -0500, David Miller escribió: > From: Christopher Díaz Riveros> Date: Wed, 17 Jan 2018 16:10:28 -0500 > > > Trivial fix removes unneeded semicolons after if blocks. > > > > This issue was detected by using the Coccinelle software. > > > > Signed-off-by: Christopher Díaz Riveros > > Applied, thanks. Thank you David, It's actually my first patch accepted in the kernel :) sorry if it's a very tiny trivial one, but I'll keep learning and sending more meaningful ones in the future. Regards, -- Christopher Díaz Riveros Gentoo Linux Developer GPG Fingerprint: E517 5ECB 8152 98E4 FEBC 2BAA 4DBB D10F 0FDD 2547 signature.asc Description: This is a digitally signed message part
Re: [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc
On Thu, Jan 18, 2018 at 07:31:56PM +, Al Viro wrote: > * SIOCADDRT/SIOCDELRT in compat ioctls To bring back a question I'd asked back in October - what do we do about SIOC...RT compat? To recap: * AF_INET sockets expect struct rtentry; it differs between 32bit and 64bit, so routing_ioctl() in net/socket.c is called from compat_sock_ioctl_trans() and does the right thing. All proto_ops instances with .family = PF_INET (and only they) have inet_ioctl() as ->ioctl(), and end up with ip_rt_ioctl() called for native ones. Three of those have ->compat_ioctl() set to inet_compat_ioctl(), the rest have it NULL. In any case, inet_compat_ioctl() ignores those, leaving them to compat_sock_ioctl_trans() to pick up. * for AF_INET6 the situation is similar, except that they use struct in6_rtmsg. Compat is also dealt with in routing_ioctl(). inet6_ioctl() for all such proto_ops (and only those), ipv6_route_ioctl() is what ends up handling the native ones. No ->compat_ioctl() in any of those. * AF_PACKET sockets expect struct rt_entry and actually bounce the native calls to inet_ioctl(). No ->compat_ioctl() there, but routing_ioctl() in net/socket.c does the right thing. * AF_APPLETALK sockets expect struct rt_entry. Native handled in atrtr_ioctl(); there is ->compat_ioctl(), but it ignores those ioctls, so we go through the conversion in net/socket.c. Also happens to work correctly. * ax25, ipx, netrom, rose and x25 use structures of their own, and those structures have identical layouts on 32bit and 64bit. x25 has ->compat_ioctl() that does the right thing (bounces to native), the rest either have ->compat_ioctl() ignoring those ioctls (ipx) or do not have ->compat_ioctl() at all. That ends up with generic code picking those and buggering them up - routing_ioctl() assumes that we want either in6_rtmsg (ipv6) or rtentry (everything else). Unfortunately, in case of these protocols we should just leave the suckers alone. Back then Ralf has verified that the bug exists and said he'd put together a fix. Looks like that fix has fallen through the cracks, though. * all other protocols fail those; usually with ENOTTY, except for AF_QIPCRTR that fails with EINVAL. Either way, compat is not an issue. Note that handling of SIOCADDRT on e.g. raw ipv4 sockets from 32bit process is convoluted as hell. The call chain is compat_sys_ioctl() compat_sock_ioctl() inet_compat_ioctl() compat_raw_ioctl() => -ENOIOCTLCMD, possibly by way of ipmr_compat_ioctl() compat_sock_ioctl_trans() routing_ioctl() [conversion done here] sock_do_ioctl() inet_ioctl() ip_rt_ioctl() A lot of those are method calls, BTW, and the overhead on those has just grown... Does anybody have objections against the following? 1) Somewhere in net/core (or net/compat.c, for that matter) add int compat_get_rtentry(struct rtentry *r, struct rtentry32 __user *p); 2) In inet_compat_ioctl() recognize SIOC{ADD,DEL}RT and do err = compat_get_rtentry(, (void __user *)arg); if (!err) err = ip_rt_ioctl(...) return err; 3) Add inet_compat_ioctl() as ->compat_ioctl in all PF_INET proto_ops. 4) Lift copyin from atrtr_ioctl() to atalk_ioctl(), teach atalk_compat_ioctl() about these ioctls (using compat_get_rtentry() and atrtr_ioctl(), that is). 5) Add ->compat_ioctl() to AF_PACKET, let it just call inet_compat_ioctl() for those two. 6) Lift copyin from ipv6_route_ioctl() to inet6_ioctl(). Add inet6_compat_ioctl() that would recognize those two, do compat copyin and call ipv6_route_ioctl(). Make it ->compat_ioctl for all PF_INET6 proto_ops. 7) Tell compat_sock_ioctl_trans() to move these two into the "just call sock_do_ioctl()" group of cases. Or, with Ralf's fix, just remove these two cases from compat_sock_ioctl_trans() completely. Either way, routing_ioctl() becomes dead code and can be removed.
Re: [RFC PATCH] e1000e: Remove Other from EIAC.
On Thu, 18 Jan 2018, Benjamin Poirier wrote: > On 2018/01/18 15:50, Benjamin Poirier wrote: > > It was reported that emulated e1000e devices in vmware esxi 6.5 Build > > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver > > overrun interrupt bursts", v4.15-rc1). Some tracing shows that after > > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other() > > on emulated e1000e devices. In comparison, on real e1000e 82574 hardware, > > icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation. > > > > Some experimentation showed that this flaw in vmware e1000e emulation can > > be worked around by not setting Other in EIAC. This is how it was before > > 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1). > > vmware folks, please comment. Thank you for bringing this to our attention. Using the reported build (ESX 6.5, 7526125) and 4.15.0-rc8+ kernel (which has the said patch), I could bring up e1000e interface (version: 3.2.6-k), get dhcp address and even do large file downloads without difficulty. Could you give us more pointers on how we may be able to reproduce this locally? Was there anything different with the configuration when the issue was observed? Is the issue consistently reproducible? Thanks, Shri
Re: [PATCH net-next 3/5] net: hns3: add ethtool -p support for phy device
On 2018/1/18 22:25, Andrew Lunn wrote: +static int hclge_set_led_status_phy(struct phy_device *phydev, int value) +{ + int ret, cur_page; + + mutex_lock(>lock); + + ret = phy_read(phydev, HCLGE_PHY_PAGE_REG); + if (ret < 0) + goto out; + else + cur_page = ret; + + ret = phy_write(phydev, HCLGE_PHY_PAGE_REG, HCLGE_PHY_PAGE_LED); + if (ret) + goto out; + + ret = phy_write(phydev, HCLGE_LED_FC_REG, value); + if (ret) + goto out; + + ret = phy_write(phydev, HCLGE_PHY_PAGE_REG, cur_page); + +out: + mutex_unlock(>lock); + return ret; +} Sorry, but NACK. Please add an interface to phylib and the phy driver you are using to do this. #define HCLGE_PHY_PAGE_MDIX 0 #define HCLGE_PHY_PAGE_COPPER 0 +#define HCLGE_PHY_PAGE_LED 3 /* Page Selection Reg. */ #define HCLGE_PHY_PAGE_REG22 @@ -73,6 +74,15 @@ /* Copper Specific Status Register */ #define HCLGE_PHY_CSS_REG 17 +/* LED Function Control Register */ +#define HCLGE_LED_FC_REG 16 + +/* LED Polarity Control Register */ +#define HCLGE_LED_PC_REG 17 + +#define HCLGE_LED_FORCE_ON 9 +#define HCLGE_LED_FORCE_OFF8 + By the looks of these defines, you assume you have a Marvell PHY. Please make this generic so anybody with a Marvell PHY can use it. Andrew Hi Andrw, As your suggestion, we need add interface to phylib and the phy driver. We will consider your suggestion and push this patch after we fix your comments. so we will remove this patch in V2 patch-set. Thanks Peng Li .
[GIT] Networking
1) Fix BPF divides by zero, from Eric Dumazet and Alexei Starovoitov. 2) Reject stores into bpf context via st and xadd, from Daniel Borkmann. 3) Fix a memory leak in TUN, from Cong Wang. 4) Disable RX aggregation on a specific troublesome configuration of r8152 in a Dell TB16b dock. 5) Fix sw_ctx leak in tls, from Sabrina Dubroca. 6) Fix program replacement in cls_bpf, from Daniel Borkmann. 7) Fix uninitialized station_info structures in cfg80211, from Johannes Berg. 8) Fix miscalculation of transport header offset field in flow dissector, from Eric Dumazet. 9) Fix LPM tree leak on failure in mlxsw driver, from Ido Schimmel. Please pull, thanks a lot! The following changes since commit 8cbab92dff778e516064c13113ca15d4869ec883: Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma (2018-01-16 16:47:40 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git for you to fetch changes up to a0dca10fce42ae82651edbe682b1c637a8ecd365: ibmvnic: Fix IPv6 packet descriptors (2018-01-18 21:19:06 -0500) Alexei Starovoitov (1): bpf: fix 32-bit divide by zero Alexey Kodanev (1): ip6_gre: init dev->mtu and dev->hard_header_len correctly Arnd Bergmann (1): fm10k: mark PM functions as __maybe_unused Christophe Leroy (1): net: fs_enet: do not call phy_stop() in interrupts Cong Wang (1): tun: fix a memory leak for tfile->tx_array Daniel Borkmann (4): bpf, arm64: fix stack_depth tracking in combination with tail calls bpf: reject stores into ctx via st and xadd bpf: fix cls_bpf on filter replace bpf: mark dst unknown on inconsistent {s, u}bounds adjustments David S. Miller (4): Merge tag 'linux-can-fixes-for-4.15-20180116' of ssh://gitolite.kernel.org/.../mkl/linux-can Merge git://git.kernel.org/.../bpf/bpf Merge tag 'wireless-drivers-for-davem-2018-01-17' of git://git.kernel.org/.../kvalo/wireless-drivers Merge tag 'linux-can-fixes-for-4.15-20180118' of ssh://gitolite.kernel.org/.../mkl/linux-can Eric Dumazet (2): bpf: fix divides by zero flow_dissector: properly cap thoff field Guenter Roeck (1): bcma: Fix 'allmodconfig' and BCMA builds on MIPS targets Ido Schimmel (1): mlxsw: spectrum_router: Free LPM tree upon failure Ilya Lesokhin (1): net/tls: Only attach to sockets in ESTABLISHED state James Hogan (1): ssb: Disable PCI host for PCI_DRIVERS_GENERIC Johannes Berg (1): cfg80211: fix station info handling bugs Kai-Heng Feng (1): r8152: disable RX aggregation on Dell TB16 dock Marc Kleine-Budde (2): can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once Rex Chang (1): Net: ethernet: ti: netcp: Fix inbound ping crash if MTU size is greater than 1500 Sabrina Dubroca (3): tls: fix sw_ctx leak tls: return -EBUSY if crypto_info is already set tls: reset crypto_info when do_tls_setsockopt_tx fails Stephane Grosjean (1): can: peak: fix potential bug in packet fragmentation Thomas Falcon (2): ibmvnic: Fix IP offload control buffer ibmvnic: Fix IPv6 packet descriptors Wei Wang (1): ipv6: don't let tb6_root node share routes with other node Wright Feng (1): brcmfmac: fix CLM load error for legacy chips when user helper is enabled Xin Long (1): netlink: reset extack earlier in netlink_rcv_skb arch/arm64/net/bpf_jit_comp.c | 20 ++- drivers/bcma/Kconfig | 2 +- drivers/net/can/usb/peak_usb/pcan_usb_fd.c| 21 +-- drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c | 16 +++-- drivers/net/ethernet/freescale/fs_enet/fs_enet.h | 1 + drivers/net/ethernet/ibm/ibmvnic.c| 24 - drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 9 ++--- drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 20 +++ drivers/net/ethernet/ti/netcp_core.c | 2 +- drivers/net/tun.c | 15 ++-- drivers/net/usb/r8152.c | 13 +++ drivers/net/wireless/broadcom/brcm80211/brcmfmac/common.c | 9 ++--- drivers/ssb/Kconfig | 2 +- kernel/bpf/core.c | 4 +-- kernel/bpf/verifier.c | 64 +++-- net/can/af_can.c | 36 --- net/core/filter.c | 4 +++ net/core/flow_dissector.c | 3 +- net/ipv6/ip6_fib.c
Re: [PATCH net v2] ibmvnic: Fix IPv6 packet descriptors
From: Thomas FalconDate: Thu, 18 Jan 2018 19:29:48 -0600 > Packet descriptor generation for IPv6 is broken. > Properly set L3 and L4 protocol flags for IPv6 descriptors. > > Signed-off-by: Thomas Falcon > --- > Changes in v2: > -keep declarations in reverse xmas tree format > -Fix whitespace errors Also applied, thank you.
Re: [PATCH net] ibmvnic: Fix IP offload control buffer
From: Thomas FalconDate: Thu, 18 Jan 2018 19:05:01 -0600 > Set some missing fields in the IP control offload buffer. This buffer is > used to enable checksum and TCP segmentation offload in the VNIC server. > The buffer length field and the checksum offloading bits were not set > properly, so fix that here. > > Signed-off-by: Thomas Falcon Applied.
Re: pull-request: can 2018-01-18
From: Marc Kleine-BuddeDate: Thu, 18 Jan 2018 11:15:05 +0100 > this is a pull reqeust of two patches for net/master: > > The syzkaller project triggered two WARN_ONCE() in the af_can code from > userspace and we decided to replace it by a pr_warn_once(). Pulled, thanks Marc.
Re: [PATCH net] ipv6: don't let tb6_root node share routes with other node
From: Wei WangDate: Thu, 18 Jan 2018 10:40:03 -0800 > From: Wei Wang > > After commit 4512c43eac7e, if we add a route to the subtree of tb6_root > which does not have any route attached to it yet, the current code will > let tb6_root and the node in the subtree share the same route. > This could cause problem cause tb6_root has RTN_INFO flag marked and the > tree repair and clean up code will not work properly. > This commit makes sure tb6_root->leaf points back to null_entry instead > of sharing route with other node. > > It fixes the following syzkaller reported issue: ... > Fixes: 4512c43eac7e ("ipv6: remove null_entry before adding default route") > Signed-off-by: Wei Wang > Acked-by: Eric Dumazet Applied, thank you.
Re: [patch 1/1] net/sched/sch_prio.c: work around gcc-4.4.4 union initializer issues
From: a...@linux-foundation.org Date: Thu, 18 Jan 2018 16:30:49 -0800 > From: Andrew Morton> Subject: net/sched/sch_prio.c: work around gcc-4.4.4 union initializer issues > > gcc-4.4.4 has problems witn anon union initializers. Work around this. > > net/sched/sch_prio.c: In function 'prio_dump_offload': > net/sched/sch_prio.c:260: error: unknown field 'stats' specified in > initializer > net/sched/sch_prio.c:260: warning: initialization makes integer from pointer > without a cast > net/sched/sch_prio.c:261: error: unknown field 'stats' specified in > initializer > net/sched/sch_prio.c:261: warning: initialization makes integer from pointer > without a cast > > Fixes: 7fdb61b44c0c95 ("net: sch: prio: Add offload ability to PRIO qdisc") > Cc: Nogah Frankel > Cc: Yuval Mintz > Cc: Jiri Pirko > Cc: David S. Miller > Signed-off-by: Andrew Morton Applied, thanks Andrew.
Re: [PATCH net v2] ip6_gre: init dev->mtu and dev->hard_header_len correctly
From: Alexey KodanevDate: Thu, 18 Jan 2018 20:51:12 +0300 > Commit b05229f44228 ("gre6: Cleanup GREv6 transmit path, > call common GRE functions") moved dev->mtu initialization > from ip6gre_tunnel_setup() to ip6gre_tunnel_init(), as a > result, the previously set values, before ndo_init(), are > reset in the following cases: > > * rtnl_create_link() can update dev->mtu from IFLA_MTU > parameter. > > * ip6gre_tnl_link_config() is invoked before ndo_init() in > netlink and ioctl setup, so ndo_init() can reset MTU > adjustments with the lower device MTU as well, dev->mtu > and dev->hard_header_len. > > Not applicable for ip6gretap because it has one more call > to ip6gre_tnl_link_config(tunnel, 1) in ip6gre_tap_init(). > > Fix the first case by updating dev->mtu with 'tb[IFLA_MTU]' > parameter if a user sets it manually on a device creation, > and fix the second one by moving ip6gre_tnl_link_config() > call after register_netdevice(). > > Fixes: b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common GRE > functions") > Fixes: db2ec95d1ba4 ("ip6_gre: Fix MTU setting") > Signed-off-by: Alexey Kodanev > --- > > v2: Instead of checking whether dev->mtu equals zero or not > in ip6gre_tunnel_init_common(), update 'dev->mtu' once > more with 'IFLA_MTU' parameter after register_netdevice(). Applied and queued up for -stable, thanks Alexey. I was almost tricked that this approach would have problems because you do the MTU setting after register_netdevice(), but we hold the RTNL semaphore so even if the device becomes globally visible the user can't perform a change of the device's MTU before we do it. Thanks.
Re: [patch net] mlxsw: spectrum_router: Free LPM tree upon failure
From: Jiri PirkoDate: Thu, 18 Jan 2018 15:42:10 +0100 > From: Ido Schimmel > > When a new LPM tree is created, we try to replace the trees in the > existing virtual routers with it. If we fail, the tree needs to be > freed. > > Currently, this does not happen in the unlikely case where we fail to > bind the tree to the first virtual router, since its reference count > never transitions from 1 to 0. > > Fix that by taking a reference before binding the tree. > > Fixes: fc922bb0dd94 ("mlxsw: spectrum_router: Use one LPM tree for all > virtual routers") > Signed-off-by: Ido Schimmel > Signed-off-by: Jiri Pirko Applied, thanks Jiri.
Re: [PATCH] bridge: return boolean instead of integer in br_multicast_is_router
Quoting Stephen Hemminger: On Thu, 18 Jan 2018 17:37:45 -0600 "Gustavo A. R. Silva" wrote: Return statements in functions returning bool should use true/false instead of 1/0. This issue was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva LGTM Fixes: 85b352693264 ("bridge: Fix build error when IGMP_SNOOPING is not enabled") Reviewed-by: Stephen Hemminger Thanks Stephen. -- Gustavo
[PATCH net v2] ibmvnic: Fix IPv6 packet descriptors
Packet descriptor generation for IPv6 is broken. Properly set L3 and L4 protocol flags for IPv6 descriptors. Signed-off-by: Thomas Falcon--- Changes in v2: -keep declarations in reverse xmas tree format -Fix whitespace errors drivers/net/ethernet/ibm/ibmvnic.c | 20 +++- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index 736df59..5397c47 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -1280,6 +1280,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev) unsigned char *dst; u64 *handle_array; int index = 0; + u8 proto = 0; int ret = 0; if (adapter->resetting) { @@ -1368,17 +1369,18 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev) } if (skb->protocol == htons(ETH_P_IP)) { - if (ip_hdr(skb)->version == 4) - tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV4; - else if (ip_hdr(skb)->version == 6) - tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV6; - - if (ip_hdr(skb)->protocol == IPPROTO_TCP) - tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_TCP; - else if (ip_hdr(skb)->protocol != IPPROTO_TCP) - tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_UDP; + tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV4; + proto = ip_hdr(skb)->protocol; + } else if (skb->protocol == htons(ETH_P_IPV6)) { + tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV6; + proto = ipv6_hdr(skb)->nexthdr; } + if (proto == IPPROTO_TCP) + tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_TCP; + else if (proto == IPPROTO_UDP) + tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_UDP; + if (skb->ip_summed == CHECKSUM_PARTIAL) { tx_crq.v1.flags1 |= IBMVNIC_TX_CHKSUM_OFFLOAD; hdrs += 2; -- 1.8.3.1
Re: [PATCH 0/3] Check gso_size of packets when forwarding
Pravin Shelarwrites: > On Thu, Jan 18, 2018 at 5:08 AM, Daniel Axtens wrote: >> Pravin Shelar writes: >> >>> On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens wrote: When regular packets are forwarded, we validate their size against the MTU of the destination device. However, when GSO packets are forwarded, we do not validate their size against the MTU. We implicitly assume that when they are segmented, the resultant packets will be correctly sized. This is not always the case. We observed a case where a packet received on an ibmveth device had a GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x device, where it caused a firmware assert. This is described in detail at [0] and was the genesis of this series. Rather than fixing it in the driver, this series fixes the forwarding path. >>> Are there any other possible forwarding path in networking stack? TC >>> is one subsystem that could forward such a packet to the bnx2x device, >>> how is that handled ? >> >> So far I have only looked at bridges, openvswitch and macvlan. In >> general, if the code uses dev_forward_skb() it should automatically be >> fine as that invokes is_skb_forwardable(), which we patch. >> > But there are other ways to forward packets, e.g tc-mirred or bpf > redirect. We need to handle all these cases rather than fixing one at > a time. As Jason suggested netif_needs_gso() looks like good function > to validate if a device is capable of handling given GSO packet. I am not entirely sure this is a better solution. The biggest reason I am uncomfortable with this is that if netif_needs_gso() returns true, the skb will be segmented. The segment sizes will be based on gso_size. Since gso_size is greater than the MTU, the resulting segments will themselves be over-MTU. Those over-MTU segments will then be passed to the network card. I think we should not be creating over-MTU segments; we should instead be dropping the packet and logging. I do take the point that you and Jason are making: a more generic fix would be good. I'm just not sure where to put it. Regards, Daniel
Re: net: r8169: a question of memory barrier in the r8169 driver
On 2018/1/19 9:11, Francois Romieu wrote: Jia-Ju Bai: [...] The function rtl8169_start_xmit reads tp->dirty_tx in TX_FRAGS_READY_FOR: if (unlikely(!TX_FRAGS_READY_FOR(tp, skb_shinfo(skb)->nr_frags))) { netif_err(tp, drv, dev, "BUG! Tx Ring full when queue awake!\n"); goto err_stop_0; } But there is no memory barrier around this code. Is there a possible data race here? This code would not even be needed if rtl8169_start_xmit was only your usual ndo_start_xmit handler: Realtek {ab / re}used it for GSO handling (see r8169_csum_workaround). If the test is not a no-op in this GSO context, it's racy. Thanks for reply. I didn't clearly understand your meaning... I wonder whether there is a possible data race and whether a "smp_mb" is needed before this code? By the way, do you mean that this code can be removed? Thanks, Jia-Ju Bai
Re: net: r8169: a question of memory barrier in the r8169 driver
Jia-Ju Bai: [...] > The function rtl8169_start_xmit reads tp->dirty_tx in TX_FRAGS_READY_FOR: > if (unlikely(!TX_FRAGS_READY_FOR(tp, skb_shinfo(skb)->nr_frags))) { > netif_err(tp, drv, dev, "BUG! Tx Ring full when queue awake!\n"); > goto err_stop_0; > } > But there is no memory barrier around this code. > > Is there a possible data race here? This code would not even be needed if rtl8169_start_xmit was only your usual ndo_start_xmit handler: Realtek {ab / re}used it for GSO handling (see r8169_csum_workaround). If the test is not a no-op in this GSO context, it's racy. -- Ueimor
Re: net: r8169: a question of memory barrier in the r8169 driver
Peter Zijlstra: [...] > There is only 1 variable afaict. Memory barriers need at least 2 in > order to be able to do _anything_. I don't get your point: why don't {cur_tx, dirty_tx} qualify as said two variables ? -- Ueimor
Re: [PATCH net] ibmvnic: Fix IPv6 packet descriptors
On 01/18/2018 07:05 PM, Thomas Falcon wrote: > Packet descriptor generation for IPv6 is broken. > Properly set L3 and L4 protocol flags for IPv6 descriptors. > > Signed-off-by: Thomas FalconSorry! Forgot to check for formatting. Please ignore this patch. > --- > drivers/net/ethernet/ibm/ibmvnic.c | 22 -- > 1 file changed, 12 insertions(+), 10 deletions(-) > > diff --git a/drivers/net/ethernet/ibm/ibmvnic.c > b/drivers/net/ethernet/ibm/ibmvnic.c > index 3700bfa..4bb41cd 100644 > --- a/drivers/net/ethernet/ibm/ibmvnic.c > +++ b/drivers/net/ethernet/ibm/ibmvnic.c > @@ -1281,6 +1281,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct > net_device *netdev) > u64 *handle_array; > int index = 0; > int ret = 0; > + u8 proto = 0; > > if (adapter->resetting) { > if (!netif_subqueue_stopped(netdev, skb)) > @@ -1367,18 +1368,19 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct > net_device *netdev) > tx_crq.v1.vlan_id = cpu_to_be16(skb->vlan_tci); > } > > - if (skb->protocol == htons(ETH_P_IP)) { > - if (ip_hdr(skb)->version == 4) > - tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV4; > - else if (ip_hdr(skb)->version == 6) > - tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV6; > - > - if (ip_hdr(skb)->protocol == IPPROTO_TCP) > - tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_TCP; > - else if (ip_hdr(skb)->protocol != IPPROTO_TCP) > - tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_UDP; > +if (skb->protocol == htons(ETH_P_IP)) { > + tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV4; > + proto = ip_hdr(skb)->protocol; > + } else if (skb->protocol == htons(ETH_P_IPV6)) { > + tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV6; > + proto = ipv6_hdr(skb)->nexthdr; > } > > + if (proto == IPPROTO_TCP) > + tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_TCP; > + else if (proto == IPPROTO_UDP) > + tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_UDP; > + > if (skb->ip_summed == CHECKSUM_PARTIAL) { > tx_crq.v1.flags1 |= IBMVNIC_TX_CHKSUM_OFFLOAD; > hdrs += 2;
[PATCH net] ibmvnic: Fix IP offload control buffer
Set some missing fields in the IP control offload buffer. This buffer is used to enable checksum and TCP segmentation offload in the VNIC server. The buffer length field and the checksum offloading bits were not set properly, so fix that here. Signed-off-by: Thomas Falcon--- drivers/net/ethernet/ibm/ibmvnic.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index b676fa9..3700bfa 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -3351,7 +3351,11 @@ static void handle_query_ip_offload_rsp(struct ibmvnic_adapter *adapter) return; } + adapter->ip_offload_ctrl.len = + cpu_to_be32(sizeof(adapter->ip_offload_ctrl)); adapter->ip_offload_ctrl.version = cpu_to_be32(INITIAL_VERSION_IOB); + adapter->ip_offload_ctrl.ipv4_chksum = buf->ipv4_chksum; + adapter->ip_offload_ctrl.ipv6_chksum = buf->ipv6_chksum; adapter->ip_offload_ctrl.tcp_ipv4_chksum = buf->tcp_ipv4_chksum; adapter->ip_offload_ctrl.udp_ipv4_chksum = buf->udp_ipv4_chksum; adapter->ip_offload_ctrl.tcp_ipv6_chksum = buf->tcp_ipv6_chksum; -- 1.8.3.1
[PATCH net] ibmvnic: Fix IPv6 packet descriptors
Packet descriptor generation for IPv6 is broken. Properly set L3 and L4 protocol flags for IPv6 descriptors. Signed-off-by: Thomas Falcon--- drivers/net/ethernet/ibm/ibmvnic.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index 3700bfa..4bb41cd 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -1281,6 +1281,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev) u64 *handle_array; int index = 0; int ret = 0; + u8 proto = 0; if (adapter->resetting) { if (!netif_subqueue_stopped(netdev, skb)) @@ -1367,18 +1368,19 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev) tx_crq.v1.vlan_id = cpu_to_be16(skb->vlan_tci); } - if (skb->protocol == htons(ETH_P_IP)) { - if (ip_hdr(skb)->version == 4) - tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV4; - else if (ip_hdr(skb)->version == 6) - tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV6; - - if (ip_hdr(skb)->protocol == IPPROTO_TCP) - tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_TCP; - else if (ip_hdr(skb)->protocol != IPPROTO_TCP) - tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_UDP; +if (skb->protocol == htons(ETH_P_IP)) { + tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV4; + proto = ip_hdr(skb)->protocol; + } else if (skb->protocol == htons(ETH_P_IPV6)) { + tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_IPV6; + proto = ipv6_hdr(skb)->nexthdr; } + if (proto == IPPROTO_TCP) + tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_TCP; + else if (proto == IPPROTO_UDP) + tx_crq.v1.flags1 |= IBMVNIC_TX_PROT_UDP; + if (skb->ip_summed == CHECKSUM_PARTIAL) { tx_crq.v1.flags1 |= IBMVNIC_TX_CHKSUM_OFFLOAD; hdrs += 2; -- 1.8.3.1
linux-next: manual merge of the net-next tree with the net tree
Hi all, Today's linux-next merge of the net-next tree got a conflict in: kernel/bpf/verifier.c between commit: 6f16101e6a8b ("bpf: mark dst unknown on inconsistent {s, u}bounds adjustments") from the net tree and commit: f4d7e40a5b71 ("bpf: introduce function calls (verification)") from the net-next tree. I fixed it up (I just used the former version) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell
Re: KASAN: slab-out-of-bounds Read in __dev_queue_xmit
On Thu, 2018-01-18 at 15:58 -0800, syzbot wrote: > Hello, > > syzbot hit the following crash on linux-next commit > 0e08c463db387a2adcb0243b15ab868a73f87807 > > So far this crash happened 6 times on linux-next, mmots, upstream. > C reproducer is attached. > syzkaller reproducer is attached. > Raw console output is attached. > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached. > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+9da69ebac7804...@syzkaller.appspotmail.com > It will help syzbot understand when the bug is fixed. See footer for > details. > If you forward the report, please keep this part and the footer. > > device syz0 entered promiscuous mode > audit: type=1400 audit(1514752309.665:10): avc: denied { net_raw } for > pid=3143 comm="syzkaller343753" capability=13 > scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 > tcontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 tclass=cap_userns > permissive=1 > audit: type=1400 audit(1514752309.668:11): avc: denied { net_admin } for > pid=3143 comm="syzkaller343753" capability=12 > scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 > tcontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 tclass=cap_userns > permissive=1 > == > BUG: KASAN: slab-out-of-bounds in __tcp_hdrlen include/linux/tcp.h:35 > [inline] > BUG: KASAN: slab-out-of-bounds in tcp_hdrlen include/linux/tcp.h:40 [inline] > BUG: KASAN: slab-out-of-bounds in qdisc_pkt_len_init net/core/dev.c:3160 > [inline] > BUG: KASAN: slab-out-of-bounds in __dev_queue_xmit+0x20d3/0x2200 > net/core/dev.c:3465 > Read of size 2 at addr 8801c85791e0 by task syzkaller343753/3143 > > CPU: 0 PID: 3143 Comm: syzkaller343753 Not tainted > 4.15.0-rc4-next-20171221+ #78 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:17 [inline] > dump_stack+0x194/0x257 lib/dump_stack.c:53 > print_address_description+0x73/0x250 mm/kasan/report.c:252 > kasan_report_error mm/kasan/report.c:351 [inline] > kasan_report+0x25b/0x340 mm/kasan/report.c:409 > __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:428 > __tcp_hdrlen include/linux/tcp.h:35 [inline] > tcp_hdrlen include/linux/tcp.h:40 [inline] > qdisc_pkt_len_init net/core/dev.c:3160 [inline] > __dev_queue_xmit+0x20d3/0x2200 net/core/dev.c:3465 > dev_queue_xmit+0x17/0x20 net/core/dev.c:3554 > packet_snd net/packet/af_packet.c:2943 [inline] > packet_sendmsg+0x3ad5/0x60a0 net/packet/af_packet.c:2968 > sock_sendmsg_nosec net/socket.c:628 [inline] > sock_sendmsg+0xca/0x110 net/socket.c:638 > sock_write_iter+0x31a/0x5d0 net/socket.c:907 > call_write_iter include/linux/fs.h:1776 [inline] > new_sync_write fs/read_write.c:469 [inline] > __vfs_write+0x684/0x970 fs/read_write.c:482 > vfs_write+0x189/0x510 fs/read_write.c:544 > SYSC_write fs/read_write.c:589 [inline] > SyS_write+0xef/0x220 fs/read_write.c:581 > entry_SYSCALL_64_fastpath+0x1f/0x96 > RIP: 0033:0x444df9 > RSP: 002b:007eff78 EFLAGS: 0297 ORIG_RAX: 0001 > RAX: ffda RBX: 7ffc3d2180f0 RCX: 00444df9 > RDX: 00ce RSI: 20fecf2b RDI: 0005 > RBP: R08: 000120080522 R09: 000120080522 > R10: 000120080522 R11: 0297 R12: 004029f0 > R13: 00402a80 R14: R15: > > Allocated by task 3143: > save_stack+0x43/0xd0 mm/kasan/kasan.c:447 > set_track mm/kasan/kasan.c:459 [inline] > kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551 > __do_kmalloc_node mm/slab.c:3673 [inline] > __kmalloc_node_track_caller+0x47/0x70 mm/slab.c:3687 > __kmalloc_reserve.isra.41+0x41/0xd0 net/core/skbuff.c:137 > __alloc_skb+0x13b/0x780 net/core/skbuff.c:205 > alloc_skb include/linux/skbuff.h:983 [inline] > alloc_skb_with_frags+0x10d/0x750 net/core/skbuff.c:5146 > sock_alloc_send_pskb+0x787/0x9b0 net/core/sock.c:2088 > packet_alloc_skb net/packet/af_packet.c:2802 [inline] > packet_snd net/packet/af_packet.c:2893 [inline] > packet_sendmsg+0x1ec2/0x60a0 net/packet/af_packet.c:2968 > sock_sendmsg_nosec net/socket.c:628 [inline] > sock_sendmsg+0xca/0x110 net/socket.c:638 > sock_write_iter+0x31a/0x5d0 net/socket.c:907 > call_write_iter include/linux/fs.h:1776 [inline] > new_sync_write fs/read_write.c:469 [inline] > __vfs_write+0x684/0x970 fs/read_write.c:482 > vfs_write+0x189/0x510 fs/read_write.c:544 > SYSC_write fs/read_write.c:589 [inline] > SyS_write+0xef/0x220 fs/read_write.c:581 > entry_SYSCALL_64_fastpath+0x1f/0x96 > > Freed by task 0: > (stack is not available) > > The buggy address belongs to the object at 8801c8578d80 > which belongs to the cache kmalloc-1024 of size 1024 > The buggy address is located 96 bytes to the right of > 1024-byte
Re: [PATCH net] net: validate untrusted gso packets
>>> This implements subset of function for codes which was removed by the >>> commit >>> I mentioned below. >> >> No, as I explain above, it performs a different check. >> > > > [...] Clearly I was wrong, sorry. Thanks for pointing out that commit and 576a30eb6453 ("[NET]: Added GSO header verification"). > For performance reason. I think we should delay the check or > segmentation > as > much as possible until it was really needed. Going through segmentation is probably as expensive as flow dissector, if not more so because of the indirect branches. >>> >>> I think we don't even need to care about this consider the evil packet >>> should be rare. >> >> How does frequency matter when a single packet can crash a host? > > > I mean consider we had fix the crash, we don't care how expensive do we spot > this. > >> >>> And what you propose here is just a very small subset of the >>> necessary checking, more comes at gso header checking. So even if we care >>> performance, it only help for some specific case. >> >> It also fixed the bug that Eric sent a separate patch for, as that did >> not dissect as a valid TCP packet, either. > > > I may miss something but how did this patch protects an evil thoff? Actually, it blocked that specific reproducer because the ip protocol did not match. I think that __skb_flow_dissect_tcp should return a boolean, causing dissection return FLOW_DISSECT_RET_OUT_BAD if the tcph is bad. That would be needed to really catch it with flow dissection at the source.
[PATCH] selftests: bpf: update .gitignore with missing generated files
Update .gitignore with missing generated files. Signed-off-by: Shuah Khan--- tools/testing/selftests/bpf/.gitignore | 7 +++ 1 file changed, 7 insertions(+) diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore index 541d9d7fad5a..1e09d77f1948 100644 --- a/tools/testing/selftests/bpf/.gitignore +++ b/tools/testing/selftests/bpf/.gitignore @@ -3,3 +3,10 @@ test_maps test_lru_map test_lpm_map test_tag +FEATURE-DUMP.libbpf +fixdep +test_align +test_dev_cgroup +test_progs +test_verifier_log +feature -- 2.14.1
Re: [PATCH iproute2 v2 0/9] ip/tunnel: Improve tunnel parameters printing
On Thu, 18 Jan 2018 16:04:27 +0200 Serhey Popovychwrote: > Continue improvements to tunnel modules. Final goal > is to make merge IP and IPv6 variants and make that > transition as transparent as possible. > > Everything within this series is open for your comments, > suggestions and criticism. > > See individual patch description message for details. > > v2 > 1) Fix checkpatch issues: no assignment in condtional > anymore. > > 2) Better code abstraction: introduce tnl_print_encap() > helper and get rid of duplicated code for printing > tunnel encapsulation options. > > Patch with subject > "ip/tunnel: Abstract tunnel encapsulation options printing" > replaces two previous in v1 series > "ip/tunnel: Use print_string() and simplify encap option printing" > and > "ip/tunnel: Minor cleanups in print routines" > > 3) Include patch with subject > "tunnel: Return constant string without copying it" > in the series: it is related to tunneling code too. > > Thanks, > Serhii > > Serhey Popovych (9): > iplink: Use ll_index_to_name() instead of if_indextoname() > ip/tunnel: Correct and unify ttl/hoplimit printing > ip/tunnel: Simplify and unify tos printing > ip/tunnel: Use print_0xhex() instead of print_string() > ip/tunnel: Abstract tunnel encapsulation options printing > gre/tunnel: Print erspan_index using print_uint() > vti/tunnel: Unify ikey/okey printing > vti6/tunnel: Unify and simplify link type help functions > tunnel: Return constant string without copying it > > bridge/fdb.c |5 +- > bridge/link.c | 19 +++ > ip/ip6tunnel.c|5 +- > ip/iplink_bond.c | 38 ++ > ip/iplink_geneve.c| 38 ++ > ip/iplink_vxlan.c | 49 -- > ip/iproute_lwtunnel.c | 11 ++--- > ip/iptunnel.c |2 +- > ip/link_gre.c | 132 > +++-- > ip/link_gre6.c| 103 +- > ip/link_ip6tnl.c | 105 ++- > ip/link_iptnl.c | 128 +++ > ip/link_vti.c | 42 > ip/link_vti6.c| 64 +++- > ip/tunnel.c | 116 --- > ip/tunnel.h |5 ++ > 16 files changed, 340 insertions(+), 522 deletions(-) > This looks fine. Thanks for following up. Applied.
Re: [PATCH iproute2] devlink: Ignore unknown attributes
On Wed, 17 Jan 2018 15:28:00 +0200 Arkadi Sharshevskywrote: > In case of extending the UAPI old packages would break. > > Signed-off-by: Arkadi Sharshevsky Looks like good future proofing. Applied.
[patch 1/1] net/sched/sch_prio.c: work around gcc-4.4.4 union initializer issues
From: Andrew MortonSubject: net/sched/sch_prio.c: work around gcc-4.4.4 union initializer issues gcc-4.4.4 has problems witn anon union initializers. Work around this. net/sched/sch_prio.c: In function 'prio_dump_offload': net/sched/sch_prio.c:260: error: unknown field 'stats' specified in initializer net/sched/sch_prio.c:260: warning: initialization makes integer from pointer without a cast net/sched/sch_prio.c:261: error: unknown field 'stats' specified in initializer net/sched/sch_prio.c:261: warning: initialization makes integer from pointer without a cast Fixes: 7fdb61b44c0c95 ("net: sch: prio: Add offload ability to PRIO qdisc") Cc: Nogah Frankel Cc: Yuval Mintz Cc: Jiri Pirko Cc: David S. Miller Signed-off-by: Andrew Morton --- net/sched/sch_prio.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff -puN net/sched/sch_prio.c~net-sched-sch_prioc-work-around-gcc-444-union-initializer-issues net/sched/sch_prio.c --- a/net/sched/sch_prio.c~net-sched-sch_prioc-work-around-gcc-444-union-initializer-issues +++ a/net/sched/sch_prio.c @@ -254,11 +254,15 @@ static int prio_dump_offload(struct Qdis { struct net_device *dev = qdisc_dev(sch); struct tc_prio_qopt_offload hw_stats = { + .command = TC_PRIO_STATS, .handle = sch->handle, .parent = sch->parent, - .command = TC_PRIO_STATS, - .stats.bstats = >bstats, - .stats.qstats = >qstats, + { + .stats = { + .bstats = >bstats, + .qstats = >qstats, + }, + }, }; int err; _
Re: [PATCH iproute2] iplink: Fix "alias" parameter length calculations
On Thu, 18 Jan 2018 16:24:40 +0200 Serhey Popovychwrote: > We need NEXT_ARG() to get *argv pointing to "alias" > parameter value. Overwise we get and check "alias" > string length. > > Fixes: f88becf35e08 ("iplink: Process "alias" parameter correctly") > Signed-off-by: Serhey Popovych Good catch. Applied thanks.
Re: [PATCH net] gso: validate gso_type if SKB_GSO_DODGY
> Okay, I'm working on a patch that adds explicit checks Sent: http://patchwork.ozlabs.org/patch/863237/ > > @@ -45,6 +45,9 @@ static struct sk_buff *sctp_gso_segment(struct sk_buff > *skb, > struct sk_buff *segs = ERR_PTR(-EINVAL); > struct sctphdr *sh; > > + if (!skb_shinfo(skb)->gso_type & SKB_GSO_SCTP) > + goto out; > + > > to all transport layer gso offloads: {sctp, tcpv[46], ufov[46], espv[46]}. > This > will block packets with gso_type X but protocol type Y from being parsed. > But does allow entering a tunnel protocol handler if that is different from Y, > unlike the above patch. > > tunnels segmentation itself is protected by skb->encapsulation. Only tunnel > devices in the stack can set this field, not virtio_net_hdr_to_skb. Packets > that > enter {gre, udp tunnel, ipxip4, ipxip6} without this bit are already dropped, > so > no new checks are needed to these based on gso_type. This is not yet sufficient. If a packet comes from userspace with tunnel headers and passes through a tunnel that sets skb->encapsulation, it is indistinguishable from a valid tunneled packet. That bit is not exclusive,
Re: [PATCH] bridge: return boolean instead of integer in br_multicast_is_router
On Thu, 18 Jan 2018 17:37:45 -0600 "Gustavo A. R. Silva"wrote: > Return statements in functions returning bool should use > true/false instead of 1/0. > > This issue was detected with the help of Coccinelle. > > Signed-off-by: Gustavo A. R. Silva LGTM Fixes: 85b352693264 ("bridge: Fix build error when IGMP_SNOOPING is not enabled") Reviewed-by: Stephen Hemminger
[PATCH net v2] gso: validate gso_type if SKB_GSO_DODGY
From: Willem de BruijnValidate gso_type during segmentation as SKB_GSO_DODGY sources may pass packets where the gso_type does not match the contents. Syzkaller was able to enter the SCTP gso handler with a packet of gso_type SKB_GSO_TCPV4. On entry of transport layer gso handlers, verify that the gso_type matches the transport protocol. Fixes: f43798c27684 ("tun: Allow GSO using virtio_net_hdr") Link: http://lkml.kernel.org/r/<001a1137452496ffc305617e5...@google.com> Reported-by: syzbot+fee64147a25aecd48...@syzkaller.appspotmail.com Signed-off-by: Willem de Bruijn --- Similar checks existed until removed in commit 5c7cdf339af5 ("gso: Remove arbitrary checks for unsupported GSO"). But those were limited to the TSO path, not software GSO. I believe that this issue goes back further, hence the Fixes at the first user of virtio_net_hdr. --- net/ipv4/esp4_offload.c | 3 +++ net/ipv4/tcp_offload.c | 3 +++ net/ipv4/udp_offload.c | 3 +++ net/ipv6/esp6_offload.c | 3 +++ net/ipv6/tcpv6_offload.c | 3 +++ net/ipv6/udp_offload.c | 3 +++ net/sctp/offload.c | 3 +++ 7 files changed, 21 insertions(+) diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c index b1338e576d00..29b333a62ab0 100644 --- a/net/ipv4/esp4_offload.c +++ b/net/ipv4/esp4_offload.c @@ -122,6 +122,9 @@ static struct sk_buff *esp4_gso_segment(struct sk_buff *skb, if (!xo) goto out; + if (!(skb_shinfo(skb)->gso_type & SKB_GSO_ESP)) + goto out; + seq = xo->seq.low; x = skb->sp->xvec[skb->sp->len - 1]; diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index b6a2aa1dcf56..4d58e2ce0b5b 100644 --- a/net/ipv4/tcp_offload.c +++ b/net/ipv4/tcp_offload.c @@ -32,6 +32,9 @@ static void tcp_gso_tstamp(struct sk_buff *skb, unsigned int ts_seq, static struct sk_buff *tcp4_gso_segment(struct sk_buff *skb, netdev_features_t features) { + if (!(skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4)) + return ERR_PTR(-EINVAL); + if (!pskb_may_pull(skb, sizeof(struct tcphdr))) return ERR_PTR(-EINVAL); diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index 01801b77bd0d..ea6e6e7df0ee 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -203,6 +203,9 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb, goto out; } + if (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP)) + goto out; + if (!pskb_may_pull(skb, sizeof(struct udphdr))) goto out; diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c index dd9627490c7c..f52c314d4c97 100644 --- a/net/ipv6/esp6_offload.c +++ b/net/ipv6/esp6_offload.c @@ -149,6 +149,9 @@ static struct sk_buff *esp6_gso_segment(struct sk_buff *skb, if (!xo) goto out; + if (!(skb_shinfo(skb)->gso_type & SKB_GSO_ESP)) + goto out; + seq = xo->seq.low; x = skb->sp->xvec[skb->sp->len - 1]; diff --git a/net/ipv6/tcpv6_offload.c b/net/ipv6/tcpv6_offload.c index d883c9204c01..278e49cd67d4 100644 --- a/net/ipv6/tcpv6_offload.c +++ b/net/ipv6/tcpv6_offload.c @@ -46,6 +46,9 @@ static struct sk_buff *tcp6_gso_segment(struct sk_buff *skb, { struct tcphdr *th; + if (!(skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6)) + return ERR_PTR(-EINVAL); + if (!pskb_may_pull(skb, sizeof(*th))) return ERR_PTR(-EINVAL); diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c index a0f89ad76f9d..2a04dc9c781b 100644 --- a/net/ipv6/udp_offload.c +++ b/net/ipv6/udp_offload.c @@ -42,6 +42,9 @@ static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb, const struct ipv6hdr *ipv6h; struct udphdr *uh; + if (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP)) + goto out; + if (!pskb_may_pull(skb, sizeof(struct udphdr))) goto out; diff --git a/net/sctp/offload.c b/net/sctp/offload.c index 275925b93b29..35bc7106d182 100644 --- a/net/sctp/offload.c +++ b/net/sctp/offload.c @@ -45,6 +45,9 @@ static struct sk_buff *sctp_gso_segment(struct sk_buff *skb, struct sk_buff *segs = ERR_PTR(-EINVAL); struct sctphdr *sh; + if (!(skb_shinfo(skb)->gso_type & SKB_GSO_SCTP)) + goto out; + sh = sctp_hdr(skb); if (!pskb_may_pull(skb, sizeof(*sh))) goto out; -- 2.16.0.rc1.238.g530d649a79-goog
[PATCH] bridge: return boolean instead of integer in br_multicast_is_router
Return statements in functions returning bool should use true/false instead of 1/0. This issue was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva--- net/bridge/br_private.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 80559fd..8e13a64 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -760,7 +760,7 @@ static inline void br_multicast_flood(struct net_bridge_mdb_entry *mdst, static inline bool br_multicast_is_router(struct net_bridge *br) { - return 0; + return false; } static inline bool br_multicast_querier_exists(struct net_bridge *br, -- 2.7.4
[PATCH] netfilter: return booleans instead of integers
Return statements in functions returning bool should use true/false instead of 1/0. These issues were detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva--- net/netfilter/nf_conncount.c | 2 +- net/netfilter/xt_hashlimit.c | 2 +- net/netfilter/xt_ipcomp.c| 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c index a955182..6d65389 100644 --- a/net/netfilter/nf_conncount.c +++ b/net/netfilter/nf_conncount.c @@ -71,7 +71,7 @@ static inline bool already_closed(const struct nf_conn *conn) return conn->proto.tcp.state == TCP_CONNTRACK_TIME_WAIT || conn->proto.tcp.state == TCP_CONNTRACK_CLOSE; else - return 0; + return false; } static int key_diff(const u32 *a, const u32 *b, unsigned int klen) diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c index 5da8746..ec51d9a 100644 --- a/net/netfilter/xt_hashlimit.c +++ b/net/netfilter/xt_hashlimit.c @@ -353,7 +353,7 @@ static int htable_create(struct net *net, struct hashlimit_cfg3 *cfg, static bool select_all(const struct xt_hashlimit_htable *ht, const struct dsthash_ent *he) { - return 1; + return true; } static bool select_gc(const struct xt_hashlimit_htable *ht, diff --git a/net/netfilter/xt_ipcomp.c b/net/netfilter/xt_ipcomp.c index 000e703..7ca64a5 100644 --- a/net/netfilter/xt_ipcomp.c +++ b/net/netfilter/xt_ipcomp.c @@ -58,7 +58,7 @@ static bool comp_mt(const struct sk_buff *skb, struct xt_action_param *par) */ pr_debug("Dropping evil IPComp tinygram.\n"); par->hotdrop = true; - return 0; + return false; } return spi_match(compinfo->spis[0], compinfo->spis[1], -- 2.7.4
Re: [PATCH 2/3] net: is_skb_forwardable: validate length of GSO packet segments
On Tue, Jan 16, 2018 at 01:09:19PM +1100, Daniel Axtens wrote: > is_skb_forwardable attempts to detect if a packet is too large to > be sent to the destination device. However, this test does not > consider GSO packets, and it is possible that a GSO packet, when > resegmented, will be larger than the device can transmit. > > Add detection for packets which will be too large by creating a > skb_gso_validate_len() routine which is similar to > skb_gso_validate_mtu(), but which considers L2 headers, and wire Why not add a 3rd parameter to skb_gso_validate_mtu() instead, hlen? After all it's the only difference between both. And maybe rename _mtu if you think the term doesn't fit in the new usecase or add a wrapper around it for each case. As it is here, it gets very confusing on which function does what, IMHO. > it up in is_skb_forwardable(). > > Signed-off-by: Daniel Axtens> --- > include/linux/skbuff.h | 1 + > net/core/dev.c | 7 --- > net/core/skbuff.c | 34 ++ > 3 files changed, 39 insertions(+), 3 deletions(-) > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index b8acafd73272..6a9e4b6f80a7 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -3287,6 +3287,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, > int shiftlen); > void skb_scrub_packet(struct sk_buff *skb, bool xnet); > unsigned int skb_gso_transport_seglen(const struct sk_buff *skb); > bool skb_gso_validate_mtu(const struct sk_buff *skb, unsigned int mtu); > +bool skb_gso_validate_len(const struct sk_buff *skb, unsigned int len); > struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features); > struct sk_buff *skb_vlan_untag(struct sk_buff *skb); > int skb_ensure_writable(struct sk_buff *skb, int write_len); > diff --git a/net/core/dev.c b/net/core/dev.c > index 94435cd09072..5af04be128c1 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1841,11 +1841,12 @@ bool is_skb_forwardable(const struct net_device *dev, > const struct sk_buff *skb) > if (skb->len <= len) > return true; > > - /* if TSO is enabled, we don't care about the length as the packet > - * could be forwarded without being segmented before > + /* > + * if TSO is enabled, we need to check the size of the > + * segmented packets >*/ > if (skb_is_gso(skb)) > - return true; > + return skb_gso_validate_len(skb, len); > > return false; > } > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 01e8285aea73..aefe049e4b0c 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -4945,6 +4945,40 @@ bool skb_gso_validate_mtu(const struct sk_buff *skb, > unsigned int mtu) > } > EXPORT_SYMBOL_GPL(skb_gso_validate_mtu); > > +/** > + * skb_gso_validate_len - Will a split GSO skb fit in a given length? > + * > + * @skb: GSO skb > + * @len: length to validate against > + * > + * skb_gso_validate_len validates if a given skb will fit a wanted > + * length once split, including L2, L3 and L4 headers. > + * > + * Similar to skb_gso_validate_mtu, but inclusive of L2 headers. > + */ > +bool skb_gso_validate_len(const struct sk_buff *skb, unsigned int len) > +{ > + const struct skb_shared_info *shinfo = skb_shinfo(skb); > + const struct sk_buff *iter; > + unsigned int hlen; > + > + hlen = skb_gso_mac_seglen(skb); > + > + if (shinfo->gso_size != GSO_BY_FRAGS) > + return hlen <= len; > + > + /* Undo this so we can re-use header sizes */ > + hlen -= GSO_BY_FRAGS; > + > + skb_walk_frags(skb, iter) { > + if (hlen + skb_headlen(iter) > len) > + return false; > + } > + > + return true; > +} > +EXPORT_SYMBOL_GPL(skb_gso_validate_len); > + > static struct sk_buff *skb_reorder_vlan_header(struct sk_buff *skb) > { > if (skb_cow(skb, skb_headroom(skb)) < 0) { > -- > 2.14.1 >
[PATCH bpf-next v6 3/4] libbpf: add missing SPDX-License-Identifier
Signed-off-by: Eric LeblondAcked-by: Alexei Starovoitov --- tools/lib/bpf/bpf.c| 2 ++ tools/lib/bpf/bpf.h| 2 ++ tools/lib/bpf/libbpf.c | 2 ++ tools/lib/bpf/libbpf.h | 2 ++ 4 files changed, 8 insertions(+) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 4517dce6849d..991bd72c4153 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -1,3 +1,5 @@ +// SPDX-License-Identifier: LGPL-2.1 + /* * common eBPF ELF operations. * diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index 9f44c196931e..8d18fb73d7fb 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -1,3 +1,5 @@ +/* SPDX-License-Identifier: LGPL-2.1 */ + /* * common eBPF ELF operations. * diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index c60122d3ea85..71ddc481f349 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -1,3 +1,5 @@ +// SPDX-License-Identifier: LGPL-2.1 + /* * Common eBPF ELF object loading operations. * diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index e42f96900318..f85906533cdd 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -1,3 +1,5 @@ +/* SPDX-License-Identifier: LGPL-2.1 */ + /* * Common eBPF ELF object loading operations. * -- 2.15.1
[PATCH bpf-next v6 4/4] samples/bpf: use bpf_set_link_xdp_fd
Use bpf_set_link_xdp_fd instead of set_link_xdp_fd to remove some code duplication and benefit of netlink ext ack errors message. Signed-off-by: Eric Leblond--- samples/bpf/bpf_load.c | 102 samples/bpf/bpf_load.h | 2 +- samples/bpf/xdp1_user.c | 4 +- samples/bpf/xdp_redirect_cpu_user.c | 6 +-- samples/bpf/xdp_redirect_map_user.c | 8 +-- samples/bpf/xdp_redirect_user.c | 8 +-- samples/bpf/xdp_router_ipv4_user.c | 10 ++-- samples/bpf/xdp_tx_iptunnel_user.c | 6 +-- 8 files changed, 22 insertions(+), 124 deletions(-) diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c index 242631aa4ea2..69806d74fa53 100644 --- a/samples/bpf/bpf_load.c +++ b/samples/bpf/bpf_load.c @@ -695,105 +695,3 @@ struct ksym *ksym_search(long key) return [0]; } -int set_link_xdp_fd(int ifindex, int fd, __u32 flags) -{ - struct sockaddr_nl sa; - int sock, seq = 0, len, ret = -1; - char buf[4096]; - struct nlattr *nla, *nla_xdp; - struct { - struct nlmsghdr nh; - struct ifinfomsg ifinfo; - char attrbuf[64]; - } req; - struct nlmsghdr *nh; - struct nlmsgerr *err; - - memset(, 0, sizeof(sa)); - sa.nl_family = AF_NETLINK; - - sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE); - if (sock < 0) { - printf("open netlink socket: %s\n", strerror(errno)); - return -1; - } - - if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) { - printf("bind to netlink: %s\n", strerror(errno)); - goto cleanup; - } - - memset(, 0, sizeof(req)); - req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)); - req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; - req.nh.nlmsg_type = RTM_SETLINK; - req.nh.nlmsg_pid = 0; - req.nh.nlmsg_seq = ++seq; - req.ifinfo.ifi_family = AF_UNSPEC; - req.ifinfo.ifi_index = ifindex; - - /* started nested attribute for XDP */ - nla = (struct nlattr *)(((char *)) - + NLMSG_ALIGN(req.nh.nlmsg_len)); - nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/; - nla->nla_len = NLA_HDRLEN; - - /* add XDP fd */ - nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len); - nla_xdp->nla_type = 1/*IFLA_XDP_FD*/; - nla_xdp->nla_len = NLA_HDRLEN + sizeof(int); - memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(fd)); - nla->nla_len += nla_xdp->nla_len; - - /* if user passed in any flags, add those too */ - if (flags) { - nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len); - nla_xdp->nla_type = 3/*IFLA_XDP_FLAGS*/; - nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags); - memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(flags)); - nla->nla_len += nla_xdp->nla_len; - } - - req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len); - - if (send(sock, , req.nh.nlmsg_len, 0) < 0) { - printf("send to netlink: %s\n", strerror(errno)); - goto cleanup; - } - - len = recv(sock, buf, sizeof(buf), 0); - if (len < 0) { - printf("recv from netlink: %s\n", strerror(errno)); - goto cleanup; - } - - for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len); -nh = NLMSG_NEXT(nh, len)) { - if (nh->nlmsg_pid != getpid()) { - printf("Wrong pid %d, expected %d\n", - nh->nlmsg_pid, getpid()); - goto cleanup; - } - if (nh->nlmsg_seq != seq) { - printf("Wrong seq %d, expected %d\n", - nh->nlmsg_seq, seq); - goto cleanup; - } - switch (nh->nlmsg_type) { - case NLMSG_ERROR: - err = (struct nlmsgerr *)NLMSG_DATA(nh); - if (!err->error) - continue; - printf("nlmsg error %s\n", strerror(-err->error)); - goto cleanup; - case NLMSG_DONE: - break; - } - } - - ret = 0; - -cleanup: - close(sock); - return ret; -} diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h index 7d57a4248893..453c200b389b 100644 --- a/samples/bpf/bpf_load.h +++ b/samples/bpf/bpf_load.h @@ -61,5 +61,5 @@ struct ksym { int load_kallsyms(void); struct ksym *ksym_search(long key); -int set_link_xdp_fd(int ifindex, int fd, __u32 flags); +int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags); #endif diff --git a/samples/bpf/xdp1_user.c b/samples/bpf/xdp1_user.c index fdaefe91801d..b901ee2b3336 100644 --- a/samples/bpf/xdp1_user.c +++
Re: [PATCH net] ipv6: don't let tb6_root node share routes with other node
On Thu, Jan 18, 2018 at 03:31:29PM -0800, Wei Wang wrote: > On Thu, Jan 18, 2018 at 2:47 PM, Martin KaFai Lauwrote: > > On Thu, Jan 18, 2018 at 10:40:03AM -0800, Wei Wang wrote: > >> From: Wei Wang > >> > >> After commit 4512c43eac7e, if we add a route to the subtree of tb6_root > >> which does not have any route attached to it yet, the current code will > >> let tb6_root and the node in the subtree share the same route. > >> This could cause problem cause tb6_root has RTN_INFO flag marked and the > > You meant the RTN_RTINFO check in fib6_purge_rt()? > > > Yes. Exactly. Looks good to me. Thanks for the fix! Acked-by: Martin KaFai Lau
[PATCH bpf-next 0/4] libbpf: add XDP binding support
Hello, This patchset rebases the libbpf code on latest bpf-next code and addresses remarks by Daniel. Best regards, -- Eric Leblond
[PATCH bpf-next v6 2/4] libbpf: add error reporting in XDP
Parse netlink ext attribute to get the error message returned by the card. Code is partially take from libnl. Signed-off-by: Eric LeblondAcked-by: Alexei Starovoitov --- samples/bpf/Makefile | 2 +- tools/lib/bpf/Build| 2 +- tools/lib/bpf/bpf.c| 8 +++ tools/lib/bpf/nlattr.c | 187 + tools/lib/bpf/nlattr.h | 70 ++ 5 files changed, 267 insertions(+), 2 deletions(-) create mode 100644 tools/lib/bpf/nlattr.c create mode 100644 tools/lib/bpf/nlattr.h diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 7f61a3d57fa7..5c4cd3745282 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -45,7 +45,7 @@ hostprogs-y += xdp_rxq_info hostprogs-y += syscall_tp # Libbpf dependencies -LIBBPF := ../../tools/lib/bpf/bpf.o +LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o test_lru_dist-objs := test_lru_dist.o $(LIBBPF) diff --git a/tools/lib/bpf/Build b/tools/lib/bpf/Build index d8749756352d..64c679d67109 100644 --- a/tools/lib/bpf/Build +++ b/tools/lib/bpf/Build @@ -1 +1 @@ -libbpf-y := libbpf.o bpf.o +libbpf-y := libbpf.o bpf.o nlattr.o diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index e6c61035b64c..4517dce6849d 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -26,6 +26,7 @@ #include #include "bpf.h" #include "libbpf.h" +#include "nlattr.h" #include #include #include @@ -440,6 +441,7 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) struct nlmsghdr *nh; struct nlmsgerr *err; socklen_t addrlen; + int one = 1; memset(, 0, sizeof(sa)); sa.nl_family = AF_NETLINK; @@ -449,6 +451,11 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) return -errno; } + if (setsockopt(sock, SOL_NETLINK, NETLINK_EXT_ACK, + , sizeof(one)) < 0) { + fprintf(stderr, "Netlink error reporting not supported\n"); + } + if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) { ret = -errno; goto cleanup; @@ -525,6 +532,7 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) if (!err->error) continue; ret = err->error; + nla_dump_errormsg(nh); goto cleanup; case NLMSG_DONE: break; diff --git a/tools/lib/bpf/nlattr.c b/tools/lib/bpf/nlattr.c new file mode 100644 index ..4719434278b2 --- /dev/null +++ b/tools/lib/bpf/nlattr.c @@ -0,0 +1,187 @@ +// SPDX-License-Identifier: LGPL-2.1 + +/* + * NETLINK Netlink attributes + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation version 2.1 + * of the License. + * + * Copyright (c) 2003-2013 Thomas Graf + */ + +#include +#include "nlattr.h" +#include +#include +#include + +static uint16_t nla_attr_minlen[NLA_TYPE_MAX+1] = { + [NLA_U8]= sizeof(uint8_t), + [NLA_U16] = sizeof(uint16_t), + [NLA_U32] = sizeof(uint32_t), + [NLA_U64] = sizeof(uint64_t), + [NLA_STRING]= 1, + [NLA_FLAG] = 0, +}; + +static int nla_len(const struct nlattr *nla) +{ + return nla->nla_len - NLA_HDRLEN; +} + +static struct nlattr *nla_next(const struct nlattr *nla, int *remaining) +{ + int totlen = NLA_ALIGN(nla->nla_len); + + *remaining -= totlen; + return (struct nlattr *) ((char *) nla + totlen); +} + +static int nla_ok(const struct nlattr *nla, int remaining) +{ + return remaining >= sizeof(*nla) && + nla->nla_len >= sizeof(*nla) && + nla->nla_len <= remaining; +} + +static void *nla_data(const struct nlattr *nla) +{ + return (char *) nla + NLA_HDRLEN; +} + +static int nla_type(const struct nlattr *nla) +{ + return nla->nla_type & NLA_TYPE_MASK; +} + +static int validate_nla(struct nlattr *nla, int maxtype, + struct nla_policy *policy) +{ + struct nla_policy *pt; + unsigned int minlen = 0; + int type = nla_type(nla); + + if (type < 0 || type > maxtype) + return 0; + + pt = [type]; + + if (pt->type > NLA_TYPE_MAX) + return 0; + + if (pt->minlen) + minlen = pt->minlen; + else if (pt->type != NLA_UNSPEC) + minlen = nla_attr_minlen[pt->type]; + + if (nla_len(nla) < minlen) + return -1; + + if (pt->maxlen && nla_len(nla) > pt->maxlen) + return -1; + + if (pt->type == NLA_STRING) { + char *data = nla_data(nla); + if
[PATCH bpf-next v6 1/4] libbpf: add function to setup XDP
Most of the code is taken from set_link_xdp_fd() in bpf_load.c and slightly modified to be library compliant. Signed-off-by: Eric LeblondAcked-by: Alexei Starovoitov --- tools/lib/bpf/bpf.c| 126 + tools/lib/bpf/libbpf.c | 2 + tools/lib/bpf/libbpf.h | 4 ++ 3 files changed, 132 insertions(+) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 5128677e4117..e6c61035b64c 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -25,6 +25,16 @@ #include #include #include "bpf.h" +#include "libbpf.h" +#include +#include +#include + +#ifndef IFLA_XDP_MAX +#define IFLA_XDP 43 +#define IFLA_XDP_FD1 +#define IFLA_XDP_FLAGS 3 +#endif /* * When building perf, unistd.h is overridden. __NR_bpf is @@ -46,7 +56,9 @@ # endif #endif +#ifndef min #define min(x, y) ((x) < (y) ? (x) : (y)) +#endif static inline __u64 ptr_to_u64(const void *ptr) { @@ -413,3 +425,117 @@ int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len) return err; } + +int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags) +{ + struct sockaddr_nl sa; + int sock, seq = 0, len, ret = -1; + char buf[4096]; + struct nlattr *nla, *nla_xdp; + struct { + struct nlmsghdr nh; + struct ifinfomsg ifinfo; + char attrbuf[64]; + } req; + struct nlmsghdr *nh; + struct nlmsgerr *err; + socklen_t addrlen; + + memset(, 0, sizeof(sa)); + sa.nl_family = AF_NETLINK; + + sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE); + if (sock < 0) { + return -errno; + } + + if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) { + ret = -errno; + goto cleanup; + } + + addrlen = sizeof(sa); + if (getsockname(sock, (struct sockaddr *), ) < 0) { + ret = -errno; + goto cleanup; + } + + if (addrlen != sizeof(sa)) { + ret = -LIBBPF_ERRNO__INTERNAL; + goto cleanup; + } + + memset(, 0, sizeof(req)); + req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)); + req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; + req.nh.nlmsg_type = RTM_SETLINK; + req.nh.nlmsg_pid = 0; + req.nh.nlmsg_seq = ++seq; + req.ifinfo.ifi_family = AF_UNSPEC; + req.ifinfo.ifi_index = ifindex; + + /* started nested attribute for XDP */ + nla = (struct nlattr *)(((char *)) + + NLMSG_ALIGN(req.nh.nlmsg_len)); + nla->nla_type = NLA_F_NESTED | IFLA_XDP; + nla->nla_len = NLA_HDRLEN; + + /* add XDP fd */ + nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len); + nla_xdp->nla_type = IFLA_XDP_FD; + nla_xdp->nla_len = NLA_HDRLEN + sizeof(int); + memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(fd)); + nla->nla_len += nla_xdp->nla_len; + + /* if user passed in any flags, add those too */ + if (flags) { + nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len); + nla_xdp->nla_type = IFLA_XDP_FLAGS; + nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags); + memcpy((char *)nla_xdp + NLA_HDRLEN, , sizeof(flags)); + nla->nla_len += nla_xdp->nla_len; + } + + req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len); + + if (send(sock, , req.nh.nlmsg_len, 0) < 0) { + ret = -errno; + goto cleanup; + } + + len = recv(sock, buf, sizeof(buf), 0); + if (len < 0) { + ret = -errno; + goto cleanup; + } + + for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len); +nh = NLMSG_NEXT(nh, len)) { + if (nh->nlmsg_pid != sa.nl_pid) { + ret = -LIBBPF_ERRNO__WRNGPID; + goto cleanup; + } + if (nh->nlmsg_seq != seq) { + ret = -LIBBPF_ERRNO__INVSEQ; + goto cleanup; + } + switch (nh->nlmsg_type) { + case NLMSG_ERROR: + err = (struct nlmsgerr *)NLMSG_DATA(nh); + if (!err->error) + continue; + ret = err->error; + goto cleanup; + case NLMSG_DONE: + break; + default: + break; + } + } + + ret = 0; + +cleanup: + close(sock); + return ret; +} diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 30c776375118..c60122d3ea85 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -106,6 +106,8 @@ static const char *libbpf_strerror_table[NR_ERRNO] = { [ERRCODE_OFFSET(PROG2BIG)] = "Program too big",
Re: [PATCH bpf-next v5 2/4] libbpf: add error reporting in XDP
Hi, Sorry for the delay, missed the mail. On Sat, 2018-01-06 at 22:16 +0100, Daniel Borkmann wrote: > On 01/04/2018 09:21 AM, Eric Leblond wrote: > > Parse netlink ext attribute to get the error message returned by > > the card. Code is partially take from libnl. > > > > Signed-off-by: Eric Leblond> > Acked-by: Alexei Starovoitov > > --- > > samples/bpf/Makefile | 2 +- > > tools/lib/bpf/Build| 2 +- > > tools/lib/bpf/bpf.c| 10 ++- > > tools/lib/bpf/nlattr.c | 187 > > + > > tools/lib/bpf/nlattr.h | 70 ++ > > 5 files changed, 268 insertions(+), 3 deletions(-) > > create mode 100644 tools/lib/bpf/nlattr.c > > create mode 100644 tools/lib/bpf/nlattr.h > > > > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile > > index 4fb944a7ecf8..c889ebcba9b3 100644 > > --- a/samples/bpf/Makefile > > +++ b/samples/bpf/Makefile > > @@ -44,7 +44,7 @@ hostprogs-y += xdp_monitor > > hostprogs-y += syscall_tp > > > > # Libbpf dependencies > > -LIBBPF := ../../tools/lib/bpf/bpf.o > > +LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o > > CGROUP_HELPERS := > > ../../tools/testing/selftests/bpf/cgroup_helpers.o > > > > test_lru_dist-objs := test_lru_dist.o $(LIBBPF) > > diff --git a/tools/lib/bpf/Build b/tools/lib/bpf/Build > > index d8749756352d..64c679d67109 100644 > > --- a/tools/lib/bpf/Build > > +++ b/tools/lib/bpf/Build > > @@ -1 +1 @@ > > -libbpf-y := libbpf.o bpf.o > > +libbpf-y := libbpf.o bpf.o nlattr.o > > diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c > > index e6c61035b64c..10d71b9fdbd0 100644 > > --- a/tools/lib/bpf/bpf.c > > +++ b/tools/lib/bpf/bpf.c > > @@ -26,6 +26,7 @@ > > #include > > #include "bpf.h" > > #include "libbpf.h" > > +#include "nlattr.h" > > #include > > #include > > #include > > @@ -440,6 +441,7 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, > > __u32 flags) > > struct nlmsghdr *nh; > > struct nlmsgerr *err; > > socklen_t addrlen; > > + int one; > > Hmm, it's not initialized here to 1 ... > > > memset(, 0, sizeof(sa)); > > sa.nl_family = AF_NETLINK; > > @@ -449,6 +451,11 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, > > __u32 flags) > > return -errno; > > } > > > > + if (setsockopt(sock, SOL_NETLINK, NETLINK_EXT_ACK, > > + , sizeof(one)) < 0) { > > ... so we turn it on by chance here. Indeed, fixing that. > > + fprintf(stderr, "Netlink error reporting not > > supported\n"); > > + } > > + > > if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) { > > ret = -errno; > > goto cleanup; > > @@ -524,7 +531,8 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, > > __u32 flags) > > err = (struct nlmsgerr *)NLMSG_DATA(nh); > > if (!err->error) > > continue; > > - ret = err->error; > > + ret = -err->error; > > This one looks strange. Your prior patch added the 'ret = err->error' > and this one negates it. Which variant is the correct version? From > digging into the kernel code, my take is that 'ret = err->error' was > the correct variant since it already holds the negative error code. > Could you double check? Yes all netlink_ack usage I have seen are using the negative value of the error. Fixing that too. BR, -- Eric Leblond Blog: https://home.regit.org/
[PATCH] smc: return boolean instead of integer in using_ipsec
Return statements in functions returning bool should use true/false instead of 1/0. This issue was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva--- net/smc/smc.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/smc/smc.h b/net/smc/smc.h index 0bee9d1..ef13af4 100644 --- a/net/smc/smc.h +++ b/net/smc/smc.h @@ -258,7 +258,7 @@ static inline bool using_ipsec(struct smc_sock *smc) #else static inline bool using_ipsec(struct smc_sock *smc) { - return 0; + return false; } #endif -- 2.7.4
Re: [PATCH bpf] bpf: fix 32-bit divide by zero
On 01/18/2018 11:40 PM, Alexei Starovoitov wrote: > On 1/18/18 2:30 PM, Eric Dumazet wrote: >> On Fri, 2018-01-12 at 18:59 -0800, Alexei Starovoitov wrote: >>> due to some JITs doing if (src_reg == 0) check in 64-bit mode >>> for div/mod opreations mask upper 32-bits of src register >>> before doing the check >> >> Is the plan to fix JIT, and if they can all be fixed, >> revert this patch ? > > No need to fix JITs. > I'd rather remove 'cmp rX, 0' from JITs and let verifier emit it. > It's more generic and gives us flexibility to decide what to do > with divide by zero and other exceptions. Yeah, working on this, but patch most likely for the next window. The 'return 0' is really suboptimal as exception code for some program types, so I'd like to have struct bpf_verifier_ops to define such exception return code to be used, so that the verifier can apply this via bpf insns directly.
Re: [PATCH net] ipv6: don't let tb6_root node share routes with other node
On Thu, Jan 18, 2018 at 2:47 PM, Martin KaFai Lauwrote: > On Thu, Jan 18, 2018 at 10:40:03AM -0800, Wei Wang wrote: >> From: Wei Wang >> >> After commit 4512c43eac7e, if we add a route to the subtree of tb6_root >> which does not have any route attached to it yet, the current code will >> let tb6_root and the node in the subtree share the same route. >> This could cause problem cause tb6_root has RTN_INFO flag marked and the > You meant the RTN_RTINFO check in fib6_purge_rt()? > Yes. Exactly. >> tree repair and clean up code will not work properly. >> This commit makes sure tb6_root->leaf points back to null_entry instead >> of sharing route with other node. >> >> It fixes the following syzkaller reported issue: >> BUG: KASAN: use-after-free in ipv6_prefix_equal include/net/ipv6.h:540 >> [inline] >> BUG: KASAN: use-after-free in fib6_add_1+0x165f/0x1790 net/ipv6/ip6_fib.c:618 >> Read of size 8 at addr 8801bc043498 by task syz-executor5/19819 >> >> CPU: 1 PID: 19819 Comm: syz-executor5 Not tainted 4.15.0-rc7+ #186 >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS >> Google 01/01/2011 >> Call Trace: >> __dump_stack lib/dump_stack.c:17 [inline] >> dump_stack+0x194/0x257 lib/dump_stack.c:53 >> print_address_description+0x73/0x250 mm/kasan/report.c:252 >> kasan_report_error mm/kasan/report.c:351 [inline] >> kasan_report+0x25b/0x340 mm/kasan/report.c:409 >> __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430 >> ipv6_prefix_equal include/net/ipv6.h:540 [inline] >> fib6_add_1+0x165f/0x1790 net/ipv6/ip6_fib.c:618 >> fib6_add+0x5fa/0x1540 net/ipv6/ip6_fib.c:1214 >> __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1003 >> ip6_route_add+0x141/0x190 net/ipv6/route.c:2790 >> ipv6_route_ioctl+0x4db/0x6b0 net/ipv6/route.c:3299 >> inet6_ioctl+0xef/0x1e0 net/ipv6/af_inet6.c:520 >> sock_do_ioctl+0x65/0xb0 net/socket.c:958 >> sock_ioctl+0x2c2/0x440 net/socket.c:1055 >> vfs_ioctl fs/ioctl.c:46 [inline] >> do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686 >> SYSC_ioctl fs/ioctl.c:701 [inline] >> SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692 >> entry_SYSCALL_64_fastpath+0x23/0x9a >> RIP: 0033:0x452ac9 >> RSP: 002b:7fd42b321c58 EFLAGS: 0212 ORIG_RAX: 0010 >> RAX: ffda RBX: 0071bea0 RCX: 00452ac9 >> RDX: 20fd7000 RSI: 890b RDI: 0013 >> RBP: 049e R08: R09: >> R10: R11: 0212 R12: 006f4f70 >> R13: R14: 7fd42b3226d4 R15: >> >> Fixes: 4512c43eac7e ("ipv6: remove null_entry before adding default route") >> Signed-off-by: Wei Wang >> Acked-by: Eric Dumazet >> --- >> net/ipv6/ip6_fib.c | 10 -- >> 1 file changed, 8 insertions(+), 2 deletions(-) >> >> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c >> index 9dcc3924a975..217683d40f12 100644 >> --- a/net/ipv6/ip6_fib.c >> +++ b/net/ipv6/ip6_fib.c >> @@ -1226,8 +1226,14 @@ int fib6_add(struct fib6_node *root, struct rt6_info >> *rt, >> } >> >> if (!rcu_access_pointer(fn->leaf)) { >> - atomic_inc(>rt6i_ref); >> - rcu_assign_pointer(fn->leaf, rt); >> + if (fn->fn_flags & RTN_TL_ROOT) { >> + /* put back null_entry for root node */ >> + rcu_assign_pointer(fn->leaf, >> + info->nl_net->ipv6.ip6_null_entry); >> + } else { >> + atomic_inc(>rt6i_ref); >> + rcu_assign_pointer(fn->leaf, rt); >> + } >> } >> fn = sn; >> } >> -- >> 2.16.0.rc1.238.g530d649a79-goog >>
Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
On Thu, Jan 18, 2018 at 6:20 PM, Sowmini Varadhanwrote: > On (01/18/18 18:09), Willem de Bruijn wrote: >> If that is true in general for PF_RDS, then it is a reasonable approach. >> How about treating it as a (follow-on) optimization path. Opportunistic >> piggybacking of notifications on data reads is more widely applicable. > > sounds good. > >> > that's similar to what I have, except that it does not have the >> > MSG_PEEK part (you'd need to enforce that the data portion >> > is upper-bounded, and that the application has the responsibility >> > of sending down "enough" buffer with recvmsg). >> >> Right. I think that an upper bound is the simplest solution here. >> >> By the way, if you allocate an skb immediately on page pinning, then >> there are always sufficient skbs to store all notifications. On errqueue >> enqueue just drop the new skb and copy its notification to the body of >> the skb already on the queue, if one exists and it has room. That is >> essentially what the tcp zerocopy code does with the [data, info] range. > > ok, I'll give that a shot (I'm working through the other review comments > as well) > > fwiw, the data-corruption issue I mentioned turned out to be a day-one > bug in rds-tcp (patched in http://patchwork.ozlabs.org/patch/863183/). > The buffer reaping with zcopy (and aggressiveness of rds-stress) brought > this one out.. Thanks. Good to hear that it's not in zerocopy, itself.
Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
On (01/18/18 18:09), Willem de Bruijn wrote: > If that is true in general for PF_RDS, then it is a reasonable approach. > How about treating it as a (follow-on) optimization path. Opportunistic > piggybacking of notifications on data reads is more widely applicable. sounds good. > > that's similar to what I have, except that it does not have the > > MSG_PEEK part (you'd need to enforce that the data portion > > is upper-bounded, and that the application has the responsibility > > of sending down "enough" buffer with recvmsg). > > Right. I think that an upper bound is the simplest solution here. > > By the way, if you allocate an skb immediately on page pinning, then > there are always sufficient skbs to store all notifications. On errqueue > enqueue just drop the new skb and copy its notification to the body of > the skb already on the queue, if one exists and it has room. That is > essentially what the tcp zerocopy code does with the [data, info] range. ok, I'll give that a shot (I'm working through the other review comments as well) fwiw, the data-corruption issue I mentioned turned out to be a day-one bug in rds-tcp (patched in http://patchwork.ozlabs.org/patch/863183/). The buffer reaping with zcopy (and aggressiveness of rds-stress) brought this one out.. --Sowmini
[PATCH net-next v2] net: stmmac: Fix reception of Broadcom switches tags
Broadcom tags inserted by Broadcom switches put a 4 byte header after the MAC SA and before the EtherType, which may look like some sort of 0 length LLC/SNAP packet (tcpdump and wireshark do think that way). With ACS enabled in stmmac the packets were truncated to 8 bytes on reception, whereas clearing this bit allowed normal reception to occur. In order to make that possible, we need to pass a net_device argument to the different core_init() functions and we are dependent on the Broadcom tagger padding packets correctly (which it now does). To be as little invasive as possible, this is only done for gmac1000 when the network device is DSA-enabled (netdev_uses_dsa() returns true). Signed-off-by: Florian Fainelli--- Changes in v2: - fixed build failure in dwmac4_core.c - updated dwmac100_core.c to also clear the ASTP bit drivers/net/ethernet/stmicro/stmmac/common.h | 2 +- drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c| 3 ++- drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c | 12 +++- drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c | 15 +-- drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c| 12 +++- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c| 2 +- 6 files changed, 39 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h index ce2ea2d491ac..2ffe76c0ff74 100644 --- a/drivers/net/ethernet/stmicro/stmmac/common.h +++ b/drivers/net/ethernet/stmicro/stmmac/common.h @@ -474,7 +474,7 @@ struct mac_device_info; /* Helpers to program the MAC core */ struct stmmac_ops { /* MAC core initialization */ - void (*core_init)(struct mac_device_info *hw, int mtu); + void (*core_init)(struct mac_device_info *hw, struct net_device *dev); /* Enable the MAC RX/TX */ void (*set_mac)(void __iomem *ioaddr, bool enable); /* Enable and verify that the IPC module is supported */ diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c index 9eb7f65d8000..a3fa65b1ca8e 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c @@ -483,7 +483,8 @@ static int sun8i_dwmac_init(struct platform_device *pdev, void *priv) return 0; } -static void sun8i_dwmac_core_init(struct mac_device_info *hw, int mtu) +static void sun8i_dwmac_core_init(struct mac_device_info *hw, + struct net_device *dev) { void __iomem *ioaddr = hw->pcsr; u32 v; diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c index 8a86340ff2d3..540d21786a43 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c @@ -25,18 +25,28 @@ #include #include #include +#include #include #include "stmmac_pcs.h" #include "dwmac1000.h" -static void dwmac1000_core_init(struct mac_device_info *hw, int mtu) +static void dwmac1000_core_init(struct mac_device_info *hw, + struct net_device *dev) { void __iomem *ioaddr = hw->pcsr; u32 value = readl(ioaddr + GMAC_CONTROL); + int mtu = dev->mtu; /* Configure GMAC core */ value |= GMAC_CORE_INIT; + /* Clear ACS bit because Ethernet switch tagging formats such as +* Broadcom tags can look like invalid LLC/SNAP packets and cause the +* hardware to truncate packets on reception. +*/ + if (netdev_uses_dsa(dev)) + value &= ~GMAC_CONTROL_ACS; + if (mtu > 1500) value |= GMAC_CONTROL_2K; if (mtu > 2000) diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c index 8ef517356313..91b23f9db31a 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c @@ -25,15 +25,26 @@ ***/ #include +#include #include #include "dwmac100.h" -static void dwmac100_core_init(struct mac_device_info *hw, int mtu) +static void dwmac100_core_init(struct mac_device_info *hw, + struct net_device *dev) { void __iomem *ioaddr = hw->pcsr; u32 value = readl(ioaddr + MAC_CONTROL); - writel((value | MAC_CORE_INIT), ioaddr + MAC_CONTROL); + value |= MAC_CORE_INIT; + + /* Clear ASTP bit because Ethernet switch tagging formats such as +* Broadcom tags can look like invalid LLC/SNAP packets and cause the +* hardware to truncate packets on reception. +*/ + if (netdev_uses_dsa(dev)) + value &= ~MAC_CONTROL_ASTP; + + writel(value, ioaddr + MAC_CONTROL); #ifdef
[PATCH bpf-next 0/2] bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map
This patch set implements MAP_GET_NEXT_KEY command for LPM_TRIE map. This command is really useful for key enumeration, and for key deletion if what keys in the trie are unknown. Patch #1 implements the functionality in the kernel and patch #2 adds a test case in tools/testing/selftests/bpf. Yonghong Song (2): bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map tools/bpf: add a testcase for MAP_GET_NEXT_KEY command of LPM_TRIE map kernel/bpf/lpm_trie.c | 95 +- tools/testing/selftests/bpf/test_lpm_map.c | 122 + 2 files changed, 215 insertions(+), 2 deletions(-) -- 2.9.5
[PATCH bpf-next 1/2] bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map
Current LPM_TRIE map type does not implement MAP_GET_NEXT_KEY command. This command is handy when users want to enumerate keys. Otherwise, a different map which supports key enumeration may be required to store the keys. If the map data is sparse and all map data are to be deleted without closing file descriptor, using MAP_GET_NEXT_KEY to find all keys is much faster than enumerating all key space. This patch implements MAP_GET_NEXT_KEY command for LPM_TRIE map. If user provided key pointer is NULL or the key does not have an exact match in the trie, the first key will be returned. Otherwise, the next key will be returned. In this implemenation, key enumeration follows a postorder traversal of internal trie. More specific keys will be returned first than less specific ones, given a sequence of MAP_GET_NEXT_KEY syscalls. Signed-off-by: Yonghong Song--- kernel/bpf/lpm_trie.c | 95 +-- 1 file changed, 93 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index 584e022..d7ea962 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -591,9 +591,100 @@ static void trie_free(struct bpf_map *map) raw_spin_unlock(>lock); } -static int trie_get_next_key(struct bpf_map *map, void *key, void *next_key) +static int trie_get_next_key(struct bpf_map *map, void *_key, void *_next_key) { - return -ENOTSUPP; + struct lpm_trie *trie = container_of(map, struct lpm_trie, map); + struct bpf_lpm_trie_key *key = _key, *next_key = _next_key; + struct lpm_trie_node *node, *next_node = NULL, *parent; + struct lpm_trie_node **node_stack = NULL; + struct lpm_trie_node __rcu **root; + int err = 0, stack_ptr = -1; + unsigned int next_bit; + size_t matchlen; + + /* The get_next_key follows postorder. For the 4 node example in +* the top of this file, the trie_get_next_key() returns the following +* one after another: +* 192.168.0.0/24 +* 192.168.1.0/24 +* 192.168.128.0/24 +* 192.168.0.0/16 +* +* The idea is to return more specific keys before less specific ones. +*/ + + /* Empty trie */ + if (!rcu_dereference(trie->root)) + return -ENOENT; + + /* For invalid key, find the leftmost node in the trie */ + if (!key || key->prefixlen > trie->max_prefixlen) { + root = >root; + goto find_leftmost; + } + + node_stack = kmalloc(trie->max_prefixlen * sizeof(struct lpm_trie_node *), +GFP_USER | __GFP_NOWARN); + if (!node_stack) + return -ENOMEM; + + /* Try to find the exact node for the given key */ + for (node = rcu_dereference(trie->root); node;) { + node_stack[++stack_ptr] = node; + matchlen = longest_prefix_match(trie, node, key); + if (node->prefixlen != matchlen || + node->prefixlen == key->prefixlen) + break; + + next_bit = extract_bit(key->data, node->prefixlen); + node = rcu_dereference(node->child[next_bit]); + } + if (!node || node->prefixlen != key->prefixlen || + (node->flags & LPM_TREE_NODE_FLAG_IM)) { + root = >root; + goto find_leftmost; + } + + /* The node with the exactly-matching key has been found, +* find the first node in postorder after the matched node. +*/ + node = node_stack[stack_ptr]; + while (stack_ptr > 0) { + parent = node_stack[stack_ptr - 1]; + if (rcu_dereference(parent->child[0]) == node && + rcu_dereference(parent->child[1])) { + root = >child[1]; + goto find_leftmost; + } + if (!(parent->flags & LPM_TREE_NODE_FLAG_IM)) { + next_node = parent; + goto do_copy; + } + + node = parent; + stack_ptr--; + } + + /* did not find anything */ + err = -ENOENT; + goto free_stack; + +find_leftmost: + /* Find the leftmost non-intermediate node, all intermediate nodes +* have exact two children, so this function will never return NULL. +*/ + for (node = rcu_dereference(*root); node;) { + if (!(node->flags & LPM_TREE_NODE_FLAG_IM)) + next_node = node; + node = rcu_dereference(node->child[0]); + } +do_copy: + next_key->prefixlen = next_node->prefixlen; + memcpy((void *)next_key + offsetof(struct bpf_lpm_trie_key, data), + next_node->data, trie->data_size); +free_stack: + kfree(node_stack); + return err; } const struct bpf_map_ops trie_map_ops = { -- 2.9.5
Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
On Thu, Jan 18, 2018 at 6:03 PM, Sowmini Varadhanwrote: > On (01/18/18 17:54), Willem de Bruijn wrote: >> > 2. If we have the option of passing completion-notification up as ancillary >> >data on the pollin/recvmsg channel itself (instead of MSG_ERRQUEUE) >> >> This assumes a somewhat symmetric workload, where there are enough recv >> calls to reap the notification associated with the send calls. > > Your comment about the assumption is true, but at least for the database > use-cases, we have a request-response model, so the assumption works out.. > I dont know if many other workloads that send large buffers have this > pattern. If that is true in general for PF_RDS, then it is a reasonable approach. How about treating it as a (follow-on) optimization path. Opportunistic piggybacking of notifications on data reads is more widely applicable. > >> I would stay with MSG_ERRQUEUE processing. One option is to pass data >> up to userspace in the data portion of the notification skb instead of >> encoding it in ancillary data, like tcp_get_timestamping_opt_stats. > > that's similar to what I have, except that it does not have the > MSG_PEEK part (you'd need to enforce that the data portion > is upper-bounded, and that the application has the responsibility > of sending down "enough" buffer with recvmsg). Right. I think that an upper bound is the simplest solution here. By the way, if you allocate an skb immediately on page pinning, then there are always sufficient skbs to store all notifications. On errqueue enqueue just drop the new skb and copy its notification to the body of the skb already on the queue, if one exists and it has room. That is essentially what the tcp zerocopy code does with the [data, info] range. > Note that any one of these choices are ok with me- I have no > special attachments to any of them.
[PATCH bpf-next 2/2] tools/bpf: add a testcase for MAP_GET_NEXT_KEY command of LPM_TRIE map
A test case is added in tools/testing/selftests/bpf/test_lpm_map.c for MAP_GET_NEXT_KEY command. A four node trie, which is described in kernel/bpf/lpm_trie.c, is built and the MAP_GET_NEXT_KEY results are checked. Signed-off-by: Yonghong Song--- tools/testing/selftests/bpf/test_lpm_map.c | 122 + 1 file changed, 122 insertions(+) diff --git a/tools/testing/selftests/bpf/test_lpm_map.c b/tools/testing/selftests/bpf/test_lpm_map.c index f614806..0815108 100644 --- a/tools/testing/selftests/bpf/test_lpm_map.c +++ b/tools/testing/selftests/bpf/test_lpm_map.c @@ -521,6 +521,126 @@ static void test_lpm_delete(void) close(map_fd); } +static void test_lpm_get_next_key(void) +{ + struct bpf_lpm_trie_key *key_p, *next_key_p; + size_t key_size; + __u32 value = 0; + int map_fd; + + key_size = sizeof(*key_p) + sizeof(__u32); + key_p = alloca(key_size); + next_key_p = alloca(key_size); + + map_fd = bpf_create_map(BPF_MAP_TYPE_LPM_TRIE, key_size, sizeof(value), + 100, BPF_F_NO_PREALLOC); + assert(map_fd >= 0); + + /* empty tree. get_next_key should return ENOENT */ + assert(bpf_map_get_next_key(map_fd, NULL, key_p) == -1 && + errno == ENOENT); + + /* get and verify the first key, get the second one should fail. */ + key_p->prefixlen = 16; + inet_pton(AF_INET, "192.168.0.0", key_p->data); + assert(bpf_map_update_elem(map_fd, key_p, , 0) == 0); + + memset(key_p, 0, key_size); + assert(bpf_map_get_next_key(map_fd, NULL, key_p) == 0); + assert(key_p->prefixlen == 16 && key_p->data[0] == 192 && + key_p->data[1] == 168); + + assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == -1 && + errno == ENOENT); + + /* no exact matching key should get the first one in post order. */ + key_p->prefixlen = 8; + assert(bpf_map_get_next_key(map_fd, NULL, key_p) == 0); + assert(key_p->prefixlen == 16 && key_p->data[0] == 192 && + key_p->data[1] == 168); + + /* add one more element (total two) */ + key_p->prefixlen = 24; + inet_pton(AF_INET, "192.168.0.0", key_p->data); + assert(bpf_map_update_elem(map_fd, key_p, , 0) == 0); + + memset(key_p, 0, key_size); + assert(bpf_map_get_next_key(map_fd, NULL, key_p) == 0); + assert(key_p->prefixlen == 24 && key_p->data[0] == 192 && + key_p->data[1] == 168 && key_p->data[2] == 0); + + memset(next_key_p, 0, key_size); + assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == 0); + assert(next_key_p->prefixlen == 16 && next_key_p->data[0] == 192 && + next_key_p->data[1] == 168); + + memcpy(key_p, next_key_p, key_size); + assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == -1 && + errno == ENOENT); + + /* Add one more element (total three) */ + key_p->prefixlen = 24; + inet_pton(AF_INET, "192.168.128.0", key_p->data); + assert(bpf_map_update_elem(map_fd, key_p, , 0) == 0); + + memset(key_p, 0, key_size); + assert(bpf_map_get_next_key(map_fd, NULL, key_p) == 0); + assert(key_p->prefixlen == 24 && key_p->data[0] == 192 && + key_p->data[1] == 168 && key_p->data[2] == 0); + + memset(next_key_p, 0, key_size); + assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == 0); + assert(next_key_p->prefixlen == 24 && next_key_p->data[0] == 192 && + next_key_p->data[1] == 168 && next_key_p->data[2] == 128); + + memcpy(key_p, next_key_p, key_size); + assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == 0); + assert(next_key_p->prefixlen == 16 && next_key_p->data[0] == 192 && + next_key_p->data[1] == 168); + + memcpy(key_p, next_key_p, key_size); + assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == -1 && + errno == ENOENT); + + /* Add one more element (total four) */ + key_p->prefixlen = 24; + inet_pton(AF_INET, "192.168.1.0", key_p->data); + assert(bpf_map_update_elem(map_fd, key_p, , 0) == 0); + + memset(key_p, 0, key_size); + assert(bpf_map_get_next_key(map_fd, NULL, key_p) == 0); + assert(key_p->prefixlen == 24 && key_p->data[0] == 192 && + key_p->data[1] == 168 && key_p->data[2] == 0); + + memset(next_key_p, 0, key_size); + assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == 0); + assert(next_key_p->prefixlen == 24 && next_key_p->data[0] == 192 && + next_key_p->data[1] == 168 && next_key_p->data[2] == 1); + + memcpy(key_p, next_key_p, key_size); + assert(bpf_map_get_next_key(map_fd, key_p, next_key_p) == 0); + assert(next_key_p->prefixlen == 24 && next_key_p->data[0] == 192 && + next_key_p->data[1] == 168 && next_key_p->data[2] == 128); + +
It's all about 14
Hello , facebook is given out 14,000,000.USD (Fourteen Million Dollars) its all about 14 Please, respond with your Unique Code (FB/BF14-13M5250UD) using your registration email, to the Verification Department at; dustinmoskovitz.faceb...@gmail.com Dustin Moskovitz Facebook Team Copyright © 2018 Facebook Int'l
Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
On (01/18/18 17:54), Willem de Bruijn wrote: > > 2. If we have the option of passing completion-notification up as ancillary > >data on the pollin/recvmsg channel itself (instead of MSG_ERRQUEUE) > > This assumes a somewhat symmetric workload, where there are enough recv > calls to reap the notification associated with the send calls. Your comment about the assumption is true, but at least for the database use-cases, we have a request-response model, so the assumption works out.. I dont know if many other workloads that send large buffers have this pattern. > I would stay with MSG_ERRQUEUE processing. One option is to pass data > up to userspace in the data portion of the notification skb instead of > encoding it in ancillary data, like tcp_get_timestamping_opt_stats. that's similar to what I have, except that it does not have the MSG_PEEK part (you'd need to enforce that the data portion is upper-bounded, and that the application has the responsibility of sending down "enough" buffer with recvmsg). Note that any one of these choices are ok with me- I have no special attachments to any of them. --Sowmini
Re: [PATCHv3 net-next 7/7] net: sched: cls_u32: add extack support
Hi, On Thu, Jan 18, 2018 at 11:51 AM, Jiri Pirkowrote: > Thu, Jan 18, 2018 at 05:20:55PM CET, ar...@mojatatu.com wrote: >>This patch adds extack support for the u32 classifier as example for >>delete and init callback. >> >>Cc: David Ahern >>Signed-off-by: Alexander Aring > > Acked-by: Jiri Pirko > > Any plan to continue with other classifiers? Like cls_flower? :) > Not yet, the next step is extack support in tc actions. If you want to do cls_flower extack support go ahead and make it. As you are the author of it, you know better than everybody else how the extack messages should look like. - Alex
Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
On Thu, Jan 18, 2018 at 12:12 PM, Sowmini Varadhanwrote: > On (01/18/18 08:53), Eric Dumazet wrote: >> >> The thing is : MSG_PEEK 'support' will also need SO_PEEK_OFF support. > > sure, I'll drop the MSG_PEEK idea (which I wasnt very thrilled > about anyway) > >> So lets properly design things, and not re-use legacy stuff that is >> proven to be not multi-thread ready and too complex. >> >> If you want to design a new channel of communication, do it, and >> maintain it. > > My instinct is to go with the fixed size ancillary data- which itself > allows 2 options: > > 1. cmsg_data has a sock_extended_err preamble >with ee_origin = SO_EE_ORIGIN_ZEROCOPY_COOKIE (or similar), >and the ee_data is an array of 32 bit cookies (can pack at most 8 >32-bit cookies, if we want to pack this into an skb->cb) > >Using the sock_extended_err as preamble will allow this to be usable by >existing tcp zcopy applications (they can use the ee_origin to find >out if this a batch of cookies or the existing hi/lo values). > > 2. If we have the option of passing completion-notification up as ancillary >data on the pollin/recvmsg channel itself (instead of MSG_ERRQUEUE) This assumes a somewhat symmetric workload, where there are enough recv calls to reap the notification associated with the send calls. >we dont have to try to retain "backward compat" to the >SO_EE_ORIGIN_ZEROCOPY API: we can just use a completely new data >struct for the notification and potentially pack more cookies into >48 bytes (RDS could be the first guinea pig for this- doesnt even >have to be done across all protocol families on day-1). > > I think the shmem channel suggestion would be an optional optimization > that can be added later- it may not even be necessary, since most > applications will likely be sending *and* receiving data, so passing up > cookies with recvmsg should be "good enough" to save syscall overhead > for the common case. > > I can work #2, if there are no objections to it. I would stay with MSG_ERRQUEUE processing. One option is to pass data up to userspace in the data portion of the notification skb instead of encoding it in ancillary data, like tcp_get_timestamping_opt_stats.
Re: [PATCH RFC net-next 5/6] rds: support for zcopy completion notification
>> This changes the semantics of these fields. Please add a new SO_EE_CODE flag, >> even if the semantics can be derived from the packet family (for now). > > sounds good, I'll take care of this (and other review comments) > for the next rev. > >> Even better would be if we can avoid the cookies completely. I understand >> the issue with concurrent send threads racing on obtaining a zckey value. If >> the sender could learn which zckey was chosen for a call, would that suffice? > > I'm not sure I understand the proposal- you want sendmsg to return > the cookie ("zckey") for you? How? > > even if we mangled sendmsg() to return something other than > the existing POSIX semantics, how will the application be asynchronously > notified that send has completed (on a per-message/per-datagram) basis? I'm purposely glossing over how the kernel returns this item for now. Was just wondering whether we can then assume mostly in order delivery and reuse the existing notification interface from tcp zerocopy. >From your experiments, it sounds like this is not the case. In which case there is little benefit to trying to force linear IDs derived from sk->sk_zckey. >> I suspect that in even with concurrent senders, notifications arrive >> largely in >> order, in which case we could just maintain the existing semantics and even >> reuse that implementation. > > not so. rds-stress [1] with -d8 -t8 quickly disproves this on my 10G ixgbe > connection. Okay. In that case, the cmsg cookie approach sounds fine. I had the same in an early tcp zerocopy test version, as a matter of fact. >> > } rdma; >> > struct rm_data_op { >> > unsigned intop_active:1; >> > - unsigned intop_notify:1; >> > + unsigned intop_notify:1, >> > + op_zcopy:1, > : >> > + struct rds_znotifier*op_mmp_znotifier; >> >> not necessary if op_mmp_znotifier is NULL unless set in >> rds_message_copy_from_user > > To make sure I dont misunderstand, you are suggesting that we dont > need op_zcopy, but can just check for the null-ness of > op_mmp_znotifier (yes, true, I agree)? or something else? Yes, that's what I meant.
Re: [PATCH net] ipv6: don't let tb6_root node share routes with other node
On Thu, Jan 18, 2018 at 10:40:03AM -0800, Wei Wang wrote: > From: Wei Wang> > After commit 4512c43eac7e, if we add a route to the subtree of tb6_root > which does not have any route attached to it yet, the current code will > let tb6_root and the node in the subtree share the same route. > This could cause problem cause tb6_root has RTN_INFO flag marked and the You meant the RTN_RTINFO check in fib6_purge_rt()? > tree repair and clean up code will not work properly. > This commit makes sure tb6_root->leaf points back to null_entry instead > of sharing route with other node. > > It fixes the following syzkaller reported issue: > BUG: KASAN: use-after-free in ipv6_prefix_equal include/net/ipv6.h:540 > [inline] > BUG: KASAN: use-after-free in fib6_add_1+0x165f/0x1790 net/ipv6/ip6_fib.c:618 > Read of size 8 at addr 8801bc043498 by task syz-executor5/19819 > > CPU: 1 PID: 19819 Comm: syz-executor5 Not tainted 4.15.0-rc7+ #186 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:17 [inline] > dump_stack+0x194/0x257 lib/dump_stack.c:53 > print_address_description+0x73/0x250 mm/kasan/report.c:252 > kasan_report_error mm/kasan/report.c:351 [inline] > kasan_report+0x25b/0x340 mm/kasan/report.c:409 > __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430 > ipv6_prefix_equal include/net/ipv6.h:540 [inline] > fib6_add_1+0x165f/0x1790 net/ipv6/ip6_fib.c:618 > fib6_add+0x5fa/0x1540 net/ipv6/ip6_fib.c:1214 > __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1003 > ip6_route_add+0x141/0x190 net/ipv6/route.c:2790 > ipv6_route_ioctl+0x4db/0x6b0 net/ipv6/route.c:3299 > inet6_ioctl+0xef/0x1e0 net/ipv6/af_inet6.c:520 > sock_do_ioctl+0x65/0xb0 net/socket.c:958 > sock_ioctl+0x2c2/0x440 net/socket.c:1055 > vfs_ioctl fs/ioctl.c:46 [inline] > do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686 > SYSC_ioctl fs/ioctl.c:701 [inline] > SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692 > entry_SYSCALL_64_fastpath+0x23/0x9a > RIP: 0033:0x452ac9 > RSP: 002b:7fd42b321c58 EFLAGS: 0212 ORIG_RAX: 0010 > RAX: ffda RBX: 0071bea0 RCX: 00452ac9 > RDX: 20fd7000 RSI: 890b RDI: 0013 > RBP: 049e R08: R09: > R10: R11: 0212 R12: 006f4f70 > R13: R14: 7fd42b3226d4 R15: > > Fixes: 4512c43eac7e ("ipv6: remove null_entry before adding default route") > Signed-off-by: Wei Wang > Acked-by: Eric Dumazet > --- > net/ipv6/ip6_fib.c | 10 -- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c > index 9dcc3924a975..217683d40f12 100644 > --- a/net/ipv6/ip6_fib.c > +++ b/net/ipv6/ip6_fib.c > @@ -1226,8 +1226,14 @@ int fib6_add(struct fib6_node *root, struct rt6_info > *rt, > } > > if (!rcu_access_pointer(fn->leaf)) { > - atomic_inc(>rt6i_ref); > - rcu_assign_pointer(fn->leaf, rt); > + if (fn->fn_flags & RTN_TL_ROOT) { > + /* put back null_entry for root node */ > + rcu_assign_pointer(fn->leaf, > + info->nl_net->ipv6.ip6_null_entry); > + } else { > + atomic_inc(>rt6i_ref); > + rcu_assign_pointer(fn->leaf, rt); > + } > } > fn = sn; > } > -- > 2.16.0.rc1.238.g530d649a79-goog >
Re: [PATCH bpf] bpf: fix 32-bit divide by zero
On 1/18/18 2:30 PM, Eric Dumazet wrote: On Fri, 2018-01-12 at 18:59 -0800, Alexei Starovoitov wrote: due to some JITs doing if (src_reg == 0) check in 64-bit mode for div/mod opreations mask upper 32-bits of src register before doing the check Is the plan to fix JIT, and if they can all be fixed, revert this patch ? No need to fix JITs. I'd rather remove 'cmp rX, 0' from JITs and let verifier emit it. It's more generic and gives us flexibility to decide what to do with divide by zero and other exceptions.
Re: [PATCH bpf] bpf: fix 32-bit divide by zero
On Fri, 2018-01-12 at 18:59 -0800, Alexei Starovoitov wrote: > due to some JITs doing if (src_reg == 0) check in 64-bit mode > for div/mod opreations mask upper 32-bits of src register > before doing the check > Is the plan to fix JIT, and if they can all be fixed, revert this patch ? x86 patch would be something like : diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 87f214fbe66ec163d24b12b6defc7edab612ecc9..91e4ab69573e09f793eb1c1e29d1b5ffad1d5dc7 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -548,8 +548,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, if (BPF_SRC(insn->code) == BPF_X) { /* if (src_reg == 0) return 0 */ - /* cmp r11, 0 */ - EMIT4(0x49, 0x83, 0xFB, 0x00); + if (BPF_CLASS(insn->code) == BPF_ALU64) { + /* cmp r11, 0 */ + EMIT4(0x49, 0x83, 0xFB, 0x00); + } else { + /* cmp r11d, 0 */ + EMIT4(0x41, 0x83, 0xFB, 0x00); + } /* jne .+9 (skip over pop, pop, xor and jmp) */ EMIT2(X86_JNE, 1 + 1 + 2 + 5);
[PATCH net v2 3/3] ibmvnic: Allocate and request vpd in init_resources
In reset events in which our memory allocations need to be reallocated, VPD data is being freed, but never reallocated. This can cause issues if we later attempt to access that memory or reset and attempt to free the memory. This patch moves the allocation of the VPD data to init_resources so that it will be symmetrically freed during release resources. Signed-off-by: John Allen--- diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index bb56460..f0dbb76 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -867,7 +867,7 @@ static int ibmvnic_get_vpd(struct ibmvnic_adapter *adapter) if (adapter->vpd->buff) len = adapter->vpd->len; - reinit_completion(>fw_done); + init_completion(>fw_done); crq.get_vpd_size.first = IBMVNIC_CRQ_CMD; crq.get_vpd_size.cmd = GET_VPD_SIZE; ibmvnic_send_crq(adapter, ); @@ -929,6 +929,13 @@ static int init_resources(struct ibmvnic_adapter *adapter) if (!adapter->vpd) return -ENOMEM; + /* Vital Product Data (VPD) */ + rc = ibmvnic_get_vpd(adapter); + if (rc) { + netdev_err(netdev, "failed to initialize Vital Product Data (VPD)\n"); + return rc; + } + adapter->map_id = 1; adapter->napi = kcalloc(adapter->req_rx_queues, sizeof(struct napi_struct), GFP_KERNEL); @@ -1002,7 +1009,7 @@ static int __ibmvnic_open(struct net_device *netdev) static int ibmvnic_open(struct net_device *netdev) { struct ibmvnic_adapter *adapter = netdev_priv(netdev); - int rc, vpd; + int rc; mutex_lock(>reset_lock); @@ -1030,11 +1037,6 @@ static int ibmvnic_open(struct net_device *netdev) rc = __ibmvnic_open(netdev); netif_carrier_on(netdev); - /* Vital Product Data (VPD) */ - vpd = ibmvnic_get_vpd(adapter); - if (vpd) - netdev_err(netdev, "failed to initialize Vital Product Data (VPD)\n"); - mutex_unlock(>reset_lock); return rc;
[PATCH net v2 2/3] ibmvnic: Revert to previous mtu when unsupported value requested
If we request an unsupported mtu value, the vnic server will suggest a different value. Currently we take the suggested value without question and login with that value. However, the behavior doesn't seem completely sane as attempting to change the mtu to some specific value will change the mtu to some completely different value most of the time. This patch fixes the issue by logging in with the previously used mtu value and printing an error message saying that the given mtu is unsupported. Signed-off-by: John Allen--- diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index f8f1396..bb56460 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -3608,7 +3608,17 @@ static void handle_request_cap_rsp(union ibmvnic_crq *crq, *req_value, (long int)be64_to_cpu(crq->request_capability_rsp. number), name); - *req_value = be64_to_cpu(crq->request_capability_rsp.number); + + if (be16_to_cpu(crq->request_capability_rsp.capability) == + REQ_MTU) { + pr_err("mtu of %llu is not supported. Reverting.\n", + *req_value); + *req_value = adapter->fallback.mtu; + } else { + *req_value = + be64_to_cpu(crq->request_capability_rsp.number); + } + ibmvnic_send_req_caps(adapter, 1); return; default:
[PATCH net v2 1/3] ibmvnic: Modify buffer size and number of queues on failover
Using newer backing devices can cause the required padding at the end of buffer as well as the number of queues to change after a failover. Since we currently assume that these values never change, after a failover to a backing device with different capabilities, we can get errors from the vnic server, attempt to free long term buffers that are no longer there, or not free long term buffers that should be freed. This patch resolves the issue by checking whether any of these values change, and if so perform the necessary re-allocations. Signed-off-by: John Allen--- v2: Added the line to free the long term buff in reset rx pools which caused me to hit a couple of other problems. On a failover, the number of queues can also change which doesn't get updated in the current reset code. Added some extra logic in do reset to check for changed number of queues and added some logic to release pools to ensure that it is only releasing pools that were allocated in the init pools routines. --- diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index 461014b7ccdd..12559c63c226 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -411,6 +411,10 @@ static int reset_rx_pools(struct ibmvnic_adapter *adapter) struct ibmvnic_rx_pool *rx_pool; int rx_scrqs; int i, j, rc; + u64 *size_array; + + size_array = (u64 *)((u8 *)(adapter->login_rsp_buf) + + be32_to_cpu(adapter->login_rsp_buf->off_rxadd_buff_size)); rx_scrqs = be32_to_cpu(adapter->login_rsp_buf->num_rxadd_subcrqs); for (i = 0; i < rx_scrqs; i++) { @@ -418,7 +422,17 @@ static int reset_rx_pools(struct ibmvnic_adapter *adapter) netdev_dbg(adapter->netdev, "Re-setting rx_pool[%d]\n", i); - rc = reset_long_term_buff(adapter, _pool->long_term_buff); + if (rx_pool->buff_size != be64_to_cpu(size_array[i])) { + free_long_term_buff(adapter, _pool->long_term_buff); + rx_pool->buff_size = be64_to_cpu(size_array[i]); + alloc_long_term_buff(adapter, _pool->long_term_buff, +rx_pool->size * +rx_pool->buff_size); + } else { + rc = reset_long_term_buff(adapter, + _pool->long_term_buff); + } + if (rc) return rc; @@ -440,14 +454,12 @@ static int reset_rx_pools(struct ibmvnic_adapter *adapter) static void release_rx_pools(struct ibmvnic_adapter *adapter) { struct ibmvnic_rx_pool *rx_pool; - int rx_scrqs; int i, j; if (!adapter->rx_pool) return; - rx_scrqs = be32_to_cpu(adapter->login_rsp_buf->num_rxadd_subcrqs); - for (i = 0; i < rx_scrqs; i++) { + for (i = 0; i < adapter->num_active_rx_pools; i++) { rx_pool = >rx_pool[i]; netdev_dbg(adapter->netdev, "Releasing rx_pool[%d]\n", i); @@ -470,6 +482,7 @@ static void release_rx_pools(struct ibmvnic_adapter *adapter) kfree(adapter->rx_pool); adapter->rx_pool = NULL; + adapter->num_active_rx_pools = 0; } static int init_rx_pools(struct net_device *netdev) @@ -494,6 +507,8 @@ static int init_rx_pools(struct net_device *netdev) return -1; } + adapter->num_active_rx_pools = 0; + for (i = 0; i < rxadd_subcrqs; i++) { rx_pool = >rx_pool[i]; @@ -537,6 +552,8 @@ static int init_rx_pools(struct net_device *netdev) rx_pool->next_free = 0; } + adapter->num_active_rx_pools = rxadd_subcrqs; + return 0; } @@ -587,13 +604,12 @@ static void release_vpd_data(struct ibmvnic_adapter *adapter) static void release_tx_pools(struct ibmvnic_adapter *adapter) { struct ibmvnic_tx_pool *tx_pool; - int i, tx_scrqs; + int i; if (!adapter->tx_pool) return; - tx_scrqs = be32_to_cpu(adapter->login_rsp_buf->num_txsubm_subcrqs); - for (i = 0; i < tx_scrqs; i++) { + for (i = 0; i < adapter->num_active_tx_pools; i++) { netdev_dbg(adapter->netdev, "Releasing tx_pool[%d]\n", i); tx_pool = >tx_pool[i]; kfree(tx_pool->tx_buff); @@ -604,6 +620,7 @@ static void release_tx_pools(struct ibmvnic_adapter *adapter) kfree(adapter->tx_pool); adapter->tx_pool = NULL; + adapter->num_active_tx_pools = 0; } static int init_tx_pools(struct net_device *netdev) @@ -620,6 +637,8 @@ static int init_tx_pools(struct net_device *netdev) if (!adapter->tx_pool) return -1; + adapter->num_active_tx_pools = 0; + for (i = 0; i < tx_subcrqs; i++) { tx_pool = >tx_pool[i]; @@ -667,6 +686,8 @@
[PATCH net v2 0/3] ibmvnic: Reset behavior fixes
This patchset fixes a number of issues related to ibmvnic reset uncovered from testing new Power9 machines with Everglades adapters and the new functionality to change mtu and other parameters in the driver. Changes since v1: -In patch 1/3, added the line to free the long term buffers before allocating a new one. This change inadvertently uncovered the problem that the number of queues can change after a failover as well. To fix this, we check whether or not the number of queues has changed in do_reset and if they have, we do a full release and init of the queues. -In patch 1/3, added variables to the adapter struct to track how many rx/tx pools have actually been allocated and modify the release pools routines to use these values rather than the possibly incorrect req_rx/tx_queues values. John Allen (3): ibmvnic: Modify buffer size and number of queues on failover ibmvnic: Revert to previous mtu when unsupported value requested ibmvnic: Allocate and request vpd in init_resources drivers/net/ethernet/ibm/ibmvnic.c | 73 ++ drivers/net/ethernet/ibm/ibmvnic.h | 2 ++ 2 files changed, 60 insertions(+), 15 deletions(-) -- 2.9.5
Re: [pull request bpf-next 0/8] bpf: offload: array maps, reporting and test
On 01/18/2018 04:13 AM, Jakub Kicinski wrote: > Hi! > > This set bring in the rest of map offload code held up by urgent > fixes and improvements to the BPF arrays. > > The first 3 patches take care of array map offload, similarly to > hash maps the attribute validation is split out to a separate map > op, and used for both offloaded and non-offloaded case (allocation > only happens if map is on the host). Offload support comes down > to allowing this map type through the offload check in the core. > NFP driver also rejects the delete operation in case of array maps. > > Subsequent patches add reporting of target device in a very similar > way target device of programs is reported (ifindex+netns dev/ino). > Netdevsim is extended with a trivial map implementation allowing us > to test the offload in test_offload.py. > > Last patch adds a small busy wait to NFP map IO, this improves the > response times which is especially useful for map dumps. Looks good to me as well, applied to bpf-next, thanks Jakub!
[PATCHv4 net-next 2/2] openvswitch: add erspan version I and II support
The patch adds support for openvswitch to configure erspan v1 and v2. The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr is added to uapi as a nested attribute to support all ERSPAN v1 and v2's fields. Note taht Previous commit "openvswitch: Add erspan tunnel support." was reverted since it does not design properly. Signed-off-by: William Tu--- include/uapi/linux/openvswitch.h | 11 +++ net/openvswitch/flow_netlink.c | 154 ++- 2 files changed, 164 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index dcfab5e3b55c..f1f98fd703fe 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -273,6 +273,16 @@ enum { #define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1) +enum { + OVS_ERSPAN_OPT_UNSPEC, + OVS_ERSPAN_OPT_IDX, /* u32 index */ + OVS_ERSPAN_OPT_VER, /* u8 version number */ + OVS_ERSPAN_OPT_DIR, /* u8 direction */ + OVS_ERSPAN_OPT_HWID,/* u8 hardware ID */ + __OVS_ERSPAN_OPT_MAX, +}; + +#define OVS_ERSPAN_OPT_MAX (__OVS_ERSPAN_OPT_MAX - 1) /* OVS_VPORT_ATTR_OPTIONS attributes for tunnels. */ @@ -363,6 +373,7 @@ enum ovs_tunnel_key_attr { OVS_TUNNEL_KEY_ATTR_IPV6_SRC, /* struct in6_addr src IPv6 address. */ OVS_TUNNEL_KEY_ATTR_IPV6_DST, /* struct in6_addr dst IPv6 address. */ OVS_TUNNEL_KEY_ATTR_PAD, + OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* Nested OVS_ERSPAN_OPT_* */ __OVS_TUNNEL_KEY_ATTR_MAX }; diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c index f143908b651d..c57b96b595b5 100644 --- a/net/openvswitch/flow_netlink.c +++ b/net/openvswitch/flow_netlink.c @@ -49,6 +49,7 @@ #include #include #include +#include #include "flow_netlink.h" @@ -329,7 +330,8 @@ size_t ovs_tun_key_attr_size(void) + nla_total_size(0)/* OVS_TUNNEL_KEY_ATTR_CSUM */ + nla_total_size(0)/* OVS_TUNNEL_KEY_ATTR_OAM */ + nla_total_size(256) /* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS */ - /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS is mutually exclusive with + /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS and +* OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS is mutually exclusive with * OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it. */ + nla_total_size(2)/* OVS_TUNNEL_KEY_ATTR_TP_SRC */ @@ -384,6 +386,13 @@ static const struct ovs_len_tbl ovs_vxlan_ext_key_lens[OVS_VXLAN_EXT_MAX + 1] = [OVS_VXLAN_EXT_GBP] = { .len = sizeof(u32) }, }; +static const struct ovs_len_tbl ovs_erspan_opt_lens[OVS_ERSPAN_OPT_MAX + 1] = { + [OVS_ERSPAN_OPT_IDX]= { .len = sizeof(u32) }, + [OVS_ERSPAN_OPT_VER]= { .len = sizeof(u8) }, + [OVS_ERSPAN_OPT_DIR]= { .len = sizeof(u8) }, + [OVS_ERSPAN_OPT_HWID] = { .len = sizeof(u8) }, +}; + static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1] = { [OVS_TUNNEL_KEY_ATTR_ID]= { .len = sizeof(u64) }, [OVS_TUNNEL_KEY_ATTR_IPV4_SRC] = { .len = sizeof(u32) }, @@ -400,6 +409,8 @@ static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1] .next = ovs_vxlan_ext_key_lens }, [OVS_TUNNEL_KEY_ATTR_IPV6_SRC] = { .len = sizeof(struct in6_addr) }, [OVS_TUNNEL_KEY_ATTR_IPV6_DST] = { .len = sizeof(struct in6_addr) }, + [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS] = { .len = OVS_ATTR_NESTED, + .next = ovs_erspan_opt_lens }, }; static const struct ovs_len_tbl @@ -631,6 +642,94 @@ static int vxlan_tun_opt_from_nlattr(const struct nlattr *attr, return 0; } +static int erspan_tun_opt_from_nlattr(const struct nlattr *attr, + struct sw_flow_match *match, bool is_mask, + bool log) +{ + unsigned long opt_key_offset; + struct erspan_metadata opts; + struct nlattr *a; + u16 hwid, dir; + int rem; + + BUILD_BUG_ON(sizeof(opts) > sizeof(match->key->tun_opts)); + + memset(, 0, sizeof(opts)); + nla_for_each_nested(a, attr, rem) { + int type = nla_type(a); + + if (type > OVS_ERSPAN_OPT_MAX) { + OVS_NLERR(log, "ERSPAN option %d out of range max %d", + type, OVS_ERSPAN_OPT_MAX); + return -EINVAL; + } + + if (!check_attr_len(nla_len(a), + ovs_erspan_opt_lens[type].len)) { + OVS_NLERR(log, "ERSPAN option %d has unexpected len %d expected %d", + type, nla_len(a), + ovs_erspan_opt_lens[type].len); +
[PATCHv4 net-next 1/2] net: erspan: use bitfield instead of mask and offset
Originally the erspan fields are defined as a group into a __be16 field, and use mask and offset to access each field. This is more costly due to calling ntohs/htons. The patch changes it to use bitfields. Signed-off-by: William Tu--- include/net/erspan.h | 127 ++- net/ipv4/ip_gre.c| 38 ++- net/ipv6/ip6_gre.c | 36 ++- 3 files changed, 121 insertions(+), 80 deletions(-) diff --git a/include/net/erspan.h b/include/net/erspan.h index acdf6843095d..2b75821e2ebe 100644 --- a/include/net/erspan.h +++ b/include/net/erspan.h @@ -65,16 +65,30 @@ #define GRA_MASK 0x0006 #define O_MASK 0x0001 +#define HWID_OFFSET4 +#define DIR_OFFSET 3 + /* ERSPAN version 2 metadata header */ struct erspan_md2 { __be32 timestamp; __be16 sgt; /* security group tag */ - __be16 flags; -#define P_OFFSET 15 -#define FT_OFFSET 10 -#define HWID_OFFSET4 -#define DIR_OFFSET 3 -#define GRA_OFFSET 1 +#if defined(__LITTLE_ENDIAN_BITFIELD) + __u8hwid_upper:2, + ft:5, + p:1; + __u8o:1, + gra:2, + dir:1, + hwid:4; +#elif defined(__BIG_ENDIAN_BITFIELD) + __u8p:1, + ft:5, + hwid_upper:2; + __u8hwid:4, + dir:1, + gra:2, + o:1; +#endif }; enum erspan_encap_type { @@ -95,15 +109,62 @@ struct erspan_metadata { }; struct erspan_base_hdr { - __be16 ver_vlan; -#define VER_OFFSET 12 - __be16 session_id; -#define COS_OFFSET 13 -#define EN_OFFSET 11 -#define BSO_OFFSET EN_OFFSET -#define T_OFFSET10 +#if defined(__LITTLE_ENDIAN_BITFIELD) + __u8vlan_upper:4, + ver:4; + __u8vlan:8; + __u8session_id_upper:2, + t:1, + en:2, + cos:3; + __u8session_id:8; +#elif defined(__BIG_ENDIAN_BITFIELD) + __u8ver: 4, + vlan_upper:4; + __u8vlan:8; + __u8cos:3, + en:2, + t:1, + session_id_upper:2; + __u8session_id:8; +#else +#error "Please fix " +#endif }; +static inline void set_session_id(struct erspan_base_hdr *ershdr, u16 id) +{ + ershdr->session_id = id & 0xff; + ershdr->session_id_upper = (id >> 8) & 0x3; +} + +static inline u16 get_session_id(const struct erspan_base_hdr *ershdr) +{ + return (ershdr->session_id_upper << 8) + ershdr->session_id; +} + +static inline void set_vlan(struct erspan_base_hdr *ershdr, u16 vlan) +{ + ershdr->vlan = vlan & 0xff; + ershdr->vlan_upper = (vlan >> 8) & 0xf; +} + +static inline u16 get_vlan(const struct erspan_base_hdr *ershdr) +{ + return (ershdr->vlan_upper << 8) + ershdr->vlan; +} + +static inline void set_hwid(struct erspan_md2 *md2, u8 hwid) +{ + md2->hwid = hwid & 0xf; + md2->hwid_upper = (hwid >> 4) & 0x3; +} + +static inline u8 get_hwid(const struct erspan_md2 *md2) +{ + return (md2->hwid_upper << 4) + md2->hwid; +} + static inline int erspan_hdr_len(int version) { return sizeof(struct erspan_base_hdr) + @@ -120,7 +181,7 @@ static inline u8 tos_to_cos(u8 tos) } static inline void erspan_build_header(struct sk_buff *skb, - __be32 id, u32 index, + u32 id, u32 index, bool truncate, bool is_ipv4) { struct ethhdr *eth = eth_hdr(skb); @@ -154,12 +215,12 @@ static inline void erspan_build_header(struct sk_buff *skb, memset(ershdr, 0, sizeof(*ershdr) + ERSPAN_V1_MDSIZE); /* Build base header */ - ershdr->ver_vlan = htons((vlan_tci & VLAN_MASK) | -(ERSPAN_VERSION << VER_OFFSET)); - ershdr->session_id = htons((u16)(ntohl(id) & ID_MASK) | - ((tos_to_cos(tos) << COS_OFFSET) & COS_MASK) | - (enc_type << EN_OFFSET & EN_MASK) | - ((truncate << T_OFFSET) & T_MASK)); + ershdr->ver = ERSPAN_VERSION; + ershdr->cos = tos_to_cos(tos); + ershdr->en = enc_type; + ershdr->t = truncate; + set_vlan(ershdr, vlan_tci); + set_session_id(ershdr, id); /* Build metadata */ ersmd = (struct erspan_metadata *)(ershdr + 1); @@ -187,7 +248,7 @@ static inline __be32 erspan_get_timestamp(void) } static inline void erspan_build_header_v2(struct sk_buff *skb, - __be32 id, u8 direction, u16 hwid, + u32 id, u8 direction, u16 hwid, bool truncate, bool is_ipv4) { struct ethhdr *eth = eth_hdr(skb); @@ -198,7 +259,6 @@ static inline void erspan_build_header_v2(struct sk_buff *skb, __be16 tci;
[PATCHv4 net-next 0/2] net: erspan: add support for openvswitch
The first patch refactors the erspan header definitions. Originally, the erspan fields are defined as a group into a __be16 field, and use mask and offset to access each field. This is more costly due to calling ntohs/htons and error-prone. The first patch changes it to use bitfields. The second patch introduces the new OVS tunnel key attribute to program both v1 and v2 erspan tunnel for openvswitch. William Tu (2): net: erspan: use bitfield instead of mask and offset openvswitch: add erspan version I and II support include/net/erspan.h | 127 +++- include/uapi/linux/openvswitch.h | 11 +++ net/ipv4/ip_gre.c| 38 -- net/ipv6/ip6_gre.c | 36 - net/openvswitch/flow_netlink.c | 154 ++- 5 files changed, 285 insertions(+), 81 deletions(-) --- v3->v4 change from be32 to u32 for OVS_ERSPAN_OPT_IDX, suggested by Jiri Benc. v2->v3 revert the "openvswitch: Add erspan tunnel support." commit ceaa001a170e. redesign the OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS as nested attribute v1->v2 Fix compatibility issue suggested by Pravin. -- 2.7.4
Re: [PATCH 0/3] Check gso_size of packets when forwarding
On Thu, Jan 18, 2018 at 5:08 AM, Daniel Axtenswrote: > Pravin Shelar writes: > >> On Mon, Jan 15, 2018 at 6:09 PM, Daniel Axtens wrote: >>> When regular packets are forwarded, we validate their size against the >>> MTU of the destination device. However, when GSO packets are >>> forwarded, we do not validate their size against the MTU. We >>> implicitly assume that when they are segmented, the resultant packets >>> will be correctly sized. >>> >>> This is not always the case. >>> >>> We observed a case where a packet received on an ibmveth device had a >>> GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x >>> device, where it caused a firmware assert. This is described in detail >>> at [0] and was the genesis of this series. Rather than fixing it in >>> the driver, this series fixes the forwarding path. >>> >> Are there any other possible forwarding path in networking stack? TC >> is one subsystem that could forward such a packet to the bnx2x device, >> how is that handled ? > > So far I have only looked at bridges, openvswitch and macvlan. In > general, if the code uses dev_forward_skb() it should automatically be > fine as that invokes is_skb_forwardable(), which we patch. > But there are other ways to forward packets, e.g tc-mirred or bpf redirect. We need to handle all these cases rather than fixing one at a time. As Jason suggested netif_needs_gso() looks like good function to validate if a device is capable of handling given GSO packet.
Re: [PATCH bpf-next 4/8] bpf: offload: report device information about offloaded maps
On Wed, Jan 17, 2018 at 07:13:28PM -0800, Jakub Kicinski wrote: > Tell user space about device on which the map was created. > Unfortunate reality of user ABI makes sharing this code > with program offload difficult but the information is the > same. > > Signed-off-by: Jakub KicinskiAcked-by: Alexei Starovoitov the series look good to me. Thank you for breaking them down this way to avoid conflicts between bpf-next and our array masking fixes in bpf.
Re: [PATCH v2 00/19] prevent bounds-check bypass via speculative execution
Hi Will, On Thursday, 18 January 2018 19:05:47 EET Will Deacon wrote: > On Thu, Jan 18, 2018 at 08:58:08AM -0800, Dan Williams wrote: > > On Thu, Jan 18, 2018 at 5:18 AM, Will Deacon wrote: > >> On Thu, Jan 11, 2018 at 05:41:08PM -0800, Dan Williams wrote: > >>> On Thu, Jan 11, 2018 at 5:19 PM, Linus Torvalds wrote: > On Thu, Jan 11, 2018 at 4:46 PM, Dan Williams wrote: > > This series incorporates Mark Rutland's latest ARM changes and adds > > the x86 specific implementation of 'ifence_array_ptr'. That ifence > > based approach is provided as an opt-in fallback, but the default > > mitigation, '__array_ptr', uses a 'mask' approach that removes > > conditional branches instructions, and otherwise aims to redirect > > speculation to use a NULL pointer rather than a user controlled > > value. > > Do you have any performance numbers and perhaps example code > generation? Is this noticeable? Are there any microbenchmarks showing > the difference between lfence use and the masking model? > >>> > >>> I don't have performance numbers, but here's a sample code generation > >>> from __fcheck_files, where the 'and; lea; and' sequence is portion of > >>> array_ptr() after the mask generation with 'sbb'. > >>> > >>> fdp = array_ptr(fdt->fd, fd, fdt->max_fds); > >>> > >>> 8e7: 8b 02 mov(%rdx),%eax > >>> 8e9: 48 39 c7cmp%rax,%rdi > >>> 8ec: 48 19 c9sbb%rcx,%rcx > >>> 8ef: 48 8b 42 08 mov0x8(%rdx),%rax > >>> 8f3: 48 89 femov%rdi,%rsi > >>> 8f6: 48 21 ceand%rcx,%rsi > >>> 8f9: 48 8d 04 f0 lea(%rax,%rsi,8),%rax > >>> 8fd: 48 21 c8and%rcx,%rax > > Having both seems good for testing, but wouldn't we want to pick one > in the end? > >>> > >>> I was thinking we'd keep it as a 'just in case' sort of thing, at > >>> least until the 'probably safe' assumption of the 'mask' approach has > >>> more time to settle out. > >> > >> From the arm64 side, the only concern I have (and this actually applies > >> to our CSDB sequence as well) is the calculation of the array size by > >> the caller. As Linus mentioned at the end of [1], if the determination > >> of the size argument is based on a conditional branch, then masking > >> doesn't help because you bound within the wrong range under speculation. > >> > >> We ran into this when trying to use masking to protect our uaccess > >> routines where the conditional bound is either KERNEL_DS or USER_DS. > >> It's possible that a prior conditional set_fs(KERNEL_DS) could defeat > >> the masking and so we'd need to throw some heavy barriers in set_fs to > >> make it robust. > > > > At least in the conditional mask case near set_fs() usage the approach > > we are taking is to use a barrier. I.e. the following guidance from > > Linus: > > > > "Basically, the rule is trivial: find all 'stac' users, and use address > > masking if those users already integrate the limit check, and lfence > > they don't." > > > > ...which translates to narrow the pointer for get_user() and use a > > barrier for __get_user(). > > Great, that matches my thinking re set_fs but I'm still worried about > finding all the places where the bound is conditional for other users > of the macro. Then again, finding the places that need this macro in the > first place is tough enough so perhaps analysing the bound calculation > doesn't make it much worse. It might not now, but if the bound calculation changes later, I'm pretty sure we'll forget to update the speculation barrier macro at least in some cases. Without the help of static (or possibly dynamic) code analysis I think we're bound to reintroduce problems over time, but that's true for finding places where the barrier is needed, not just for barrier selection based on bound calculation. -- Regards, Laurent Pinchart
Re: [PATCH 27/38] sctp: Copy struct sctp_sock.autoclose to userspace using put_user()
On Thu, Jan 18, 2018 at 1:31 PM, Laura Abbottwrote: > On 01/10/2018 06:02 PM, Kees Cook wrote: >> >> From: David Windsor >> >> The autoclose field can be copied with put_user(), so there is no need to >> use copy_to_user(). In both cases, hardened usercopy is being bypassed >> since the size is constant, and not open to runtime manipulation. >> >> This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY >> whitelisting code in the last public patch of grsecurity/PaX based on my >> understanding of the code. Changes or omissions from the original code are >> mine and don't reflect the original grsecurity/PaX code. >> > > Just tried a quick rebase and it looks like this conflicts with > c76f97c99ae6 ("sctp: make use of pre-calculated len") > I don't think we can use put_user if we're copying via the full > len? It should be fine, since: len = sizeof(int); c76f97c99ae6 just does a swap of sizeof(int) with len, put_user() will work in either case, since autoclose will always be int sized. -Kees > > Thanks, > Laura > > >> Signed-off-by: David Windsor >> [kees: adjust commit log] >> Cc: Vlad Yasevich >> Cc: Neil Horman >> Cc: "David S. Miller" >> Cc: linux-s...@vger.kernel.org >> Cc: netdev@vger.kernel.org >> Signed-off-by: Kees Cook >> --- >> net/sctp/socket.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/net/sctp/socket.c b/net/sctp/socket.c >> index efbc8f52c531..15491491ec88 100644 >> --- a/net/sctp/socket.c >> +++ b/net/sctp/socket.c >> @@ -5011,7 +5011,7 @@ static int sctp_getsockopt_autoclose(struct sock >> *sk, int len, char __user *optv >> len = sizeof(int); >> if (put_user(len, optlen)) >> return -EFAULT; >> - if (copy_to_user(optval, _sk(sk)->autoclose, sizeof(int))) >> + if (put_user(sctp_sk(sk)->autoclose, (int __user *)optval)) >> return -EFAULT; >> return 0; >> } >> > -- Kees Cook Pixel Security
Re: [PATCH 27/38] sctp: Copy struct sctp_sock.autoclose to userspace using put_user()
On 01/10/2018 06:02 PM, Kees Cook wrote: From: David WindsorThe autoclose field can be copied with put_user(), so there is no need to use copy_to_user(). In both cases, hardened usercopy is being bypassed since the size is constant, and not open to runtime manipulation. This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Just tried a quick rebase and it looks like this conflicts with c76f97c99ae6 ("sctp: make use of pre-calculated len") I don't think we can use put_user if we're copying via the full len? Thanks, Laura Signed-off-by: David Windsor [kees: adjust commit log] Cc: Vlad Yasevich Cc: Neil Horman Cc: "David S. Miller" Cc: linux-s...@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook --- net/sctp/socket.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sctp/socket.c b/net/sctp/socket.c index efbc8f52c531..15491491ec88 100644 --- a/net/sctp/socket.c +++ b/net/sctp/socket.c @@ -5011,7 +5011,7 @@ static int sctp_getsockopt_autoclose(struct sock *sk, int len, char __user *optv len = sizeof(int); if (put_user(len, optlen)) return -EFAULT; - if (copy_to_user(optval, _sk(sk)->autoclose, sizeof(int))) + if (put_user(sctp_sk(sk)->autoclose, (int __user *)optval)) return -EFAULT; return 0; }
[PATCH net] rds: tcp: compute m_ack_seq as offset from ->write_seq
rds-tcp uses m_ack_seq to track the tcp ack# that indicates that the peer has received a rds_message. The m_ack_seq is used in rds_tcp_is_acked() to figure out when it is safe to drop the rds_message from the RDS retransmit queue. The m_ack_seq must be calculated as an offset from the right edge of the in-flight tcp buffer, i.e., it should be based on the ->write_seq, not the ->snd_nxt. Signed-off-by: Sowmini Varadhan--- net/rds/tcp.c |5 +++-- net/rds/tcp.h |2 +- net/rds/tcp_send.c |4 ++-- 3 files changed, 6 insertions(+), 5 deletions(-) diff --git a/net/rds/tcp.c b/net/rds/tcp.c index 6b7ee71..ab7356e 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -90,9 +90,10 @@ void rds_tcp_nonagle(struct socket *sock) sizeof(val)); } -u32 rds_tcp_snd_nxt(struct rds_tcp_connection *tc) +u32 rds_tcp_write_seq(struct rds_tcp_connection *tc) { - return tcp_sk(tc->t_sock->sk)->snd_nxt; + /* seq# of the last byte of data in tcp send buffer */ + return tcp_sk(tc->t_sock->sk)->write_seq; } u32 rds_tcp_snd_una(struct rds_tcp_connection *tc) diff --git a/net/rds/tcp.h b/net/rds/tcp.h index 1aafbf7..864ca7d 100644 --- a/net/rds/tcp.h +++ b/net/rds/tcp.h @@ -54,7 +54,7 @@ struct rds_tcp_statistics { void rds_tcp_reset_callbacks(struct socket *sock, struct rds_conn_path *cp); void rds_tcp_restore_callbacks(struct socket *sock, struct rds_tcp_connection *tc); -u32 rds_tcp_snd_nxt(struct rds_tcp_connection *tc); +u32 rds_tcp_write_seq(struct rds_tcp_connection *tc); u32 rds_tcp_snd_una(struct rds_tcp_connection *tc); u64 rds_tcp_map_seq(struct rds_tcp_connection *tc, u32 seq); extern struct rds_transport rds_tcp_transport; diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c index dc860d1..9b76e0f 100644 --- a/net/rds/tcp_send.c +++ b/net/rds/tcp_send.c @@ -86,7 +86,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm, * m_ack_seq is set to the sequence number of the last byte of * header and data. see rds_tcp_is_acked(). */ - tc->t_last_sent_nxt = rds_tcp_snd_nxt(tc); + tc->t_last_sent_nxt = rds_tcp_write_seq(tc); rm->m_ack_seq = tc->t_last_sent_nxt + sizeof(struct rds_header) + be32_to_cpu(rm->m_inc.i_hdr.h_len) - 1; @@ -98,7 +98,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm, rm->m_inc.i_hdr.h_flags |= RDS_FLAG_RETRANSMITTED; rdsdebug("rm %p tcp nxt %u ack_seq %llu\n", -rm, rds_tcp_snd_nxt(tc), +rm, rds_tcp_write_seq(tc), (unsigned long long)rm->m_ack_seq); } -- 1.7.1
Re: [PATCH net] flow_dissector: properly cap thoff field
From: Eric DumazetDate: Wed, 17 Jan 2018 14:21:13 -0800 > From: Eric Dumazet > > syzbot reported yet another crash [1] that is caused by > insufficient validation of DODGY packets. > > Two bugs are happening here to trigger the crash. > > 1) Flow dissection leaves with incorrect thoff field. > > 2) skb_probe_transport_header() sets transport header to this invalid > thoff, even if pointing after skb valid data. > > 3) qdisc_pkt_len_init() reads out-of-bound data because it > trusts tcp_hdrlen(skb) > > Possible fixes : > > - Full flow dissector validation before injecting bad DODGY packets in > the stack. > This approach was attempted here : https://patchwork.ozlabs.org/patch/ > 861874/ > > - Have more robust functions in the core. > This might be needed anyway for stable versions. > > This patch fixes the flow dissection issue. > > [1] ... > Fixes: 34fad54c2537 ("net: __skb_flow_dissect() must cap its return value") > Fixes: a6e544b0a88b ("flow_dissector: Jump to exit code in > __skb_flow_dissect") > Signed-off-by: Eric Dumazet > Cc: Willem de Bruijn > Reported-by: syzbot Applied and queued up for -stable, thanks Eric.
Re: [PATCH-next] flow_netlink: Remove unneeded semicolons
From: Christopher Díaz RiverosDate: Wed, 17 Jan 2018 16:10:28 -0500 > Trivial fix removes unneeded semicolons after if blocks. > > This issue was detected by using the Coccinelle software. > > Signed-off-by: Christopher Díaz Riveros Applied, thanks.
Re: [PATCH net] gso: validate gso_type if SKB_GSO_DODGY
> Aside from inet_gso_segment and ipv6_gso_segment, this only leaves the > new nsh_gso_segment. Unlke mpls, it has its own gso_type, so > > if (!skb_shinfo(skb)->gso_type & SKB_GSO_NSH) > goto out; This last point was incorrect. There is no such type for this protocol, so we have to do the same as for MPLS. Either assume that it is robust or be conservative and add if (skb_shinfo(skb)->gso_type & SKB_GSO_DODGY) goto out; It looks correct on the surface and may already be relied upon, so I suggest the first approach.
Re: [PATCH v2] tools/bpftool: silence a static checker warning
On 01/18/2018 10:35 AM, Dan Carpenter wrote: > There is a static checker warning that "proglen" has an upper bound but > no lower bound. The allocation will just fail harmlessly so it's not a > big deal. > > Signed-off-by: Dan CarpenterApplied to bpf-next, thanks Dan! (I changed the prefix from 'tools/bpftool' into 'tools/bpf_jit_disasm' since it's not touching bpftool here.)
Re: pull-request: wireless-drivers 2018-01-17
From: Kalle ValoDate: Wed, 17 Jan 2018 16:30:21 +0200 > here are few more important fixes to the net tree for 4.15, I hope they > still make it. Please let me know if there are any problems. Pulled, thanks Kalle.
Re: [PATCH] net/mlx5e: Fix trailing semicolon
From: Luis de BethencourtDate: Wed, 17 Jan 2018 12:09:15 + > The trailing semicolon is an empty statement that does no operation. > Removing it since it doesn't do anything. > > Signed-off-by: Luis de Bethencourt Applied to net-next, thank you.
Re: [bpf-next PATCH] bpf: Sync kernel ABI header with tooling header
On 01/18/2018 05:49 PM, Jesper Dangaard Brouer wrote: > Update tools/include/uapi/linux/bpf.h to bring it in sync with > include/uapi/linux/bpf.h. The listed commits forgot to update it. > > Fixes: 02dd3291b2f0 ("bpf: finally expose xdp_rxq_info to XDP bpf-programs") > Fixes: f19397a5c656 ("bpf: Add access to snd_cwnd and others in sock_ops") > Fixes: 06ef0ccb5a36 ("bpf/cgroup: fix a verification error for a > CGROUP_DEVICE type prog") > Signed-off-by: Jesper Dangaard BrouerApplied to bpf-next, thanks Jesper!
Re: [net] fm10k: mark PM functions as __maybe_unused
From: David MillerDate: Thu, 18 Jan 2018 15:52:38 -0500 (EST) > From: Jeff Kirsher > Date: Wed, 17 Jan 2018 07:57:32 -0800 > >> From: Arnd Bergmann >> >> A cleanup of the PM code left an incorrect #ifdef in place, leading >> to a harmless build warning: >> >> drivers/net/ethernet/intel/fm10k/fm10k_pci.c:2502:12: error: 'fm10k_suspend' >> defined but not used [-Werror=unused-function] >> drivers/net/ethernet/intel/fm10k/fm10k_pci.c:2475:12: error: 'fm10k_resume' >> defined but not used [-Werror=unused-function] >> >> It's easier to use __maybe_unused attributes here, since you >> can't pick the wrong one. >> >> Fixes: 8249c47c6ba4 ("fm10k: use generic PM hooks instead of legacy PCIe >> power hooks") >> Signed-off-by: Arnd Bergmann >> Acked-by: Jacob Keller >> Tested-by: Krishneil Singh >> Signed-off-by: Jeff Kirsher > > Applied, thanks Arnd. Of course, thank you to Jeff as well :-)
Re: [PATCH net] gso: validate gso_type if SKB_GSO_DODGY
On Thu, Jan 18, 2018 at 4:14 AM, Jason Wangwrote: > > > On 2018年01月18日 14:04, Willem de Bruijn wrote: >> >> On Thu, Jan 18, 2018 at 12:20 AM, Willem de Bruijn >> wrote: >>> >>> On Wed, Jan 17, 2018 at 10:48 PM, Jason Wang wrote: On 2018年01月18日 07:11, Willem de Bruijn wrote: > > From: Willem de Bruijn > > Validate gso_type of untrusted SKB_GSO_DODGY packets during > segmentation. > > Untrusted user packets are limited to a small set of gso types in > virtio_net_hdr_to_skb. But segmentation occurs on packet contents. > Syzkaller was able to enter gso callbacks that are not hardened > against untrusted user input. > > Fixes: f43798c27684 ("tun: Allow GSO using virtio_net_hdr") This commit is suspicious, I guess it should be 5c7cdf339af5 ("gso: Remove arbitrary checks for unsupported GSO") >>> >>> The specific SCTP path was introduced with commit 90017accff61 ("sctp: >>> add GSO support"). But the main issue that packets can be delivered to >>> gso handlers different from their gso_type goes back further. >>> >>> The commit you reference is actually older than the sctp gso patch, so >>> it makes sense that it did not have a check in the sctp_gso_segment. >>> >>> I still think that we should check in inet_gso_segment when we have >>> the proto, instead of in each {tcp, sctp, udp, esp, ...} handler having >>> a check of the form. >>> >>>!(type & (SKB_GSO_TCPV4 | >>> SKB_GSO_TCPV6 >> >> Unless we can create packets that legitimate combine >> SKB_GSO_DODGY with tunnel headers. > > > As you mentioned below, looks like we can e.g bridge between tunnels (vxlan, > gre or others) and tap, or even bpf can produce this (e.g > bpf_skb_adjust_room). > >> >> virtio_net_hdr_to_skb does not accept tunneled gso types. > > > Yes, Vlad is trying to extend virtio to support more kinds of gso types, so > it will be supported for sure. > >> >> But a tun device can be bridged with a gre tunnel in the >> host, creating a path that will call gre_gso_segment. >> >> If that is possible, then this patch is indeed too strict and >> we do need checks in the individual handlers. Okay, I'm working on a patch that adds explicit checks @@ -45,6 +45,9 @@ static struct sk_buff *sctp_gso_segment(struct sk_buff *skb, struct sk_buff *segs = ERR_PTR(-EINVAL); struct sctphdr *sh; + if (!skb_shinfo(skb)->gso_type & SKB_GSO_SCTP) + goto out; + to all transport layer gso offloads: {sctp, tcpv[46], ufov[46], espv[46]}. This will block packets with gso_type X but protocol type Y from being parsed. But does allow entering a tunnel protocol handler if that is different from Y, unlike the above patch. tunnels segmentation itself is protected by skb->encapsulation. Only tunnel devices in the stack can set this field, not virtio_net_hdr_to_skb. Packets that enter {gre, udp tunnel, ipxip4, ipxip6} without this bit are already dropped, so no new checks are needed to these based on gso_type. That leaves eth_proto callbacks. mpls had this check before commit 5c7cdf339af5 ("gso: Remove arbitrary checks for unsupported GSO"): if (unlikely(skb_shinfo(skb)->gso_type & ~(SKB_GSO_TCPV4 | SKB_GSO_TCPV6 | SKB_GSO_UDP | SKB_GSO_DODGY | SKB_GSO_TCP_FIXEDID | SKB_GSO_TCP_ECN))) goto out; which appears to exclude only other tunnel types at the time, no transport layers. I don't think we need any new check here, then: it was robust against handling dodgy sources. Aside from inet_gso_segment and ipv6_gso_segment, this only leaves the new nsh_gso_segment. Unlke mpls, it has its own gso_type, so if (!skb_shinfo(skb)->gso_type & SKB_GSO_NSH) goto out;
[PATCH] net: tcp: close sock if net namespace is exiting
When a tcp socket is closed, if it detects that its net namespace is exiting, close immediately and do not wait for FIN sequence. For normal sockets, a reference is taken to their net namespace, so it will never exit while the socket is open. However, kernel sockets do not take a reference to their net namespace, so it may begin exiting while the kernel socket is still open. In this case if the kernel socket is a tcp socket, it will stay open trying to complete its close sequence. The sock's dst(s) hold a reference to their interface, which are all transferred to the namespace's loopback interface when the real interfaces are taken down. When the namespace tries to take down its loopback interface, it hangs waiting for all references to the loopback interface to release, which results in messages like: unregister_netdevice: waiting for lo to become free. Usage count = 1 These messages continue until the socket finally times out and closes. Since the net namespace cleanup holds the net_mutex while calling its registered pernet callbacks, any new net namespace initialization is blocked until the current net namespace finishes exiting. After this change, the tcp socket notices the exiting net namespace, and closes immediately, releasing its dst(s) and their reference to the loopback interface, which lets the net namespace continue exiting. Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811 Signed-off-by: Dan Streetman--- include/net/net_namespace.h | 10 ++ net/ipv4/tcp.c | 3 +++ net/ipv4/tcp_timer.c| 15 +++ 3 files changed, 28 insertions(+) diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index f8a84a2c2341..f306b2aa15a4 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -223,6 +223,11 @@ int net_eq(const struct net *net1, const struct net *net2) return net1 == net2; } +static inline int check_net(const struct net *net) +{ + return refcount_read(>count) != 0; +} + void net_drop_ns(void *); #else @@ -247,6 +252,11 @@ int net_eq(const struct net *net1, const struct net *net2) return 1; } +static inline int check_net(const struct net *net) +{ + return 1; +} + #define net_drop_ns NULL #endif diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index d7cf861bf699..9389193e73f3 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2298,6 +2298,9 @@ void tcp_close(struct sock *sk, long timeout) tcp_send_active_reset(sk, GFP_ATOMIC); __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY); + } else if (!check_net(sock_net(sk))) { + /* Not possible to send reset; just close */ + tcp_set_state(sk, TCP_CLOSE); } } diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index 6db3124cdbda..41b40b805aa3 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -48,11 +48,19 @@ static void tcp_write_err(struct sock *sk) * to prevent DoS attacks. It is called when a retransmission timeout * or zero probe timeout occurs on orphaned socket. * + * Also close if our net namespace is exiting; in that case there is no + * hope of ever communicating again since all netns interfaces are already + * down (or about to be down), and we need to release our dst references, + * which have been moved to the netns loopback interface, so the namespace + * can finish exiting. This condition is only possible if we are a kernel + * socket, as those do not hold references to the namespace. + * * Criteria is still not confirmed experimentally and may change. * We kill the socket, if: * 1. If number of orphaned sockets exceeds an administratively configured * limit. * 2. If we have strong memory pressure. + * 3. If our net namespace is exiting. */ static int tcp_out_of_resources(struct sock *sk, bool do_reset) { @@ -81,6 +89,13 @@ static int tcp_out_of_resources(struct sock *sk, bool do_reset) __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY); return 1; } + + if (!check_net(sock_net(sk))) { + /* Not possible to send reset; just close */ + tcp_done(sk); + return 1; + } + return 0; } -- 2.14.1
Re: [bpf-next PATCH] bpf: add comments to BPF ld/ldx sizes
On 01/17/2018 12:05 PM, Jesper Dangaard Brouer wrote: > Doc BPF ld/ldx size defines as comments in code, as it makes in > faster to lookup in a programming/review setting, than looking up > the sizes in Documentation/networking/filter.txt. > > Signed-off-by: Jesper Dangaard BrouerApplied to bpf-next, thanks Jesper!
Re: [PATCH bpf-next] bpf: change fake_ip for bpf_trace_printk helper
On 01/17/2018 06:19 PM, Yonghong Song wrote: > Currently, for bpf_trace_printk helper, fake ip address 0x1 > is used with comments saying that fake ip will not be printed. > This is indeed true for 4.12 and earlier version, but for > 4.13 and later version, the ip address will be printed if > it cannot be resolved with kallsym. Running samples/bpf/tracex5 > program and you will have the following in the debugfs > trace_pipe output: > ... > <...>-1819 [003] 443.497877: 0x0001: mmap > <...>-1819 [003] 443.498289: 0x0001: syscall=102 (one of > get/set uid/pid/gid) > ... > > The kernel commit changed this behavior is: > commit feaf1283d11794b9d518fcfd54b6bf8bee1f0b4b > Author: Steven Rostedt (VMware)> Date: Thu Jun 22 17:04:55 2017 -0400 > > tracing: Show address when function names are not found > ... > > This patch changed the comment and also altered the fake ip > address to 0x0 as users may think 0x1 has some special meaning > while it doesn't. The new output: > ... > <...>-1799 [002] 25.953576: 0: mmap > <...>-1799 [002] 25.953865: 0: read(fd=0, buf=053936b5, > size=512) > ... > > Signed-off-by: Yonghong Song Applied to bpf-next, thanks Yonghong!
Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd
On Thu, Jan 18, 2018 at 09:29:14PM +0100, Jiri Benc wrote: > On Thu, 18 Jan 2018 21:21:24 +0100, Christian Brauner wrote: > > In such scenarios setting a netns id property is > > not really wanted > > Why? I think that's what you should do if you want to avoid setns. Just > use netnsid. I don't see any problem with that and you didn't provide > any explanation. Ah, sorry, maybe I was to brief. When creating and destroying a lot of short-lived network namespaces it seems unnecessary to give them all label/netns id. Using a netns id makes much more sense when you want a persistent, long-living network namespace. For example, iproute2 where you want to create a persistent network namespace that sticks around via ip netns add bla && ip netns set bla 5. A more concrete scenario is creating a network namespace, moving a device into it via RTM_SETLINK which also supports IFLA_NET_NS_{FD,PID} and then wanting to query the device. This would be very easy to do if one could reuse the IFLA_NET_NS_{FD,PID} without having to set a netnsid. Christian
Re: [net] fm10k: mark PM functions as __maybe_unused
From: Jeff KirsherDate: Wed, 17 Jan 2018 07:57:32 -0800 > From: Arnd Bergmann > > A cleanup of the PM code left an incorrect #ifdef in place, leading > to a harmless build warning: > > drivers/net/ethernet/intel/fm10k/fm10k_pci.c:2502:12: error: 'fm10k_suspend' > defined but not used [-Werror=unused-function] > drivers/net/ethernet/intel/fm10k/fm10k_pci.c:2475:12: error: 'fm10k_resume' > defined but not used [-Werror=unused-function] > > It's easier to use __maybe_unused attributes here, since you > can't pick the wrong one. > > Fixes: 8249c47c6ba4 ("fm10k: use generic PM hooks instead of legacy PCIe > power hooks") > Signed-off-by: Arnd Bergmann > Acked-by: Jacob Keller > Tested-by: Krishneil Singh > Signed-off-by: Jeff Kirsher Applied, thanks Arnd.
Re: [PATCHv3 net-next 2/2] openvswitch: add erspan version I and II support
Hi Jiri, On Thu, Jan 18, 2018 at 2:13 AM, Jiri Bencwrote: > Looks much better. > > On Wed, 17 Jan 2018 09:32:51 -0800, William Tu wrote: >> + OVS_ERSPAN_OPT_IDX, /* be32 index */ > > Why don't you convert this to u32 while passing to/from user space? > >> + [OVS_ERSPAN_OPT_IDX]= { .len = sizeof(u32) }, > > sizeof(__be32) but see above. > > Thanks, > > Jiri Thanks. Your suggestion makes sense. In the beginning I just want to avoid another ntoh, hton conversion. Since the ERSPAN iproute2 also assume u32 to/from userspace, I will change it here to use u32. William