Re: DPAA Ethernet problems with mainstream Linux kernels
Hi Jamie, Many thanks for your effort. If you need a new kernel for testing please let me know. Cheers, Christian Sent from my iPhone > On 18. Jan 2018, at 02:59, Jamie Krueger > wrote: > > Hi Madalin, > > On 01/16/2018 11:33 AM, Madalin-cristian Bucur wrote: > > Here are some results from testing on the X5000/20 with 4.15.0-rc8 > (w/CONFIG_FSL_PAMU disabled): > > In my tests I can now get the interface to obtain an IP Address via DHCP, > at least *part* of the time that is. > > Here is a link to a screenshot of a Wireshark capture which shows > a successful DHCP request/response for the X5000/20, followed by some > failed ping attempts. > > (This traffic was captured on an external machine on the same subnet, and not > from the X5000/20 itself) > http://bitbybitsoftwaregroup.com/share/X5000-DHCP-Answered-Wireshark.png > > Additionally, here is a link to the export of the packet data shown in the > above image: > http://bitbybitsoftwaregroup.com/share/Wireshark-X5000-DHCP-Success.html > > As you can see by this example, even when the interface does obtain an IP > address via > DHCP (about half the time), subsequent traffic still fails to receive > responses. > > In this example, I attempt to ping the gateway (192.168.1.1), and one or two > other > local addresses, and nothing pings back (even when trying Skateman's > suggestion > of expanding the buffers for the interface - > > "A workaround to keep the nic alive a bit longer is the following command > ip link set eth0 qlen 1" > > For reference: > > I have attached the Kernel config that the version I tested was built with. > Also, I have attached the dmesg output I am currently seeing. > > -- > Best Regards, > > Jamie Krueger > BITbyBIT Software Group LLC > > >
Re: [RFC PATCH net-next 02/12] selftests: forwarding: Add a test for FDB learning
Thu, Jan 18, 2018 at 08:16:54AM CET, ido...@idosch.org wrote: >On Wed, Jan 17, 2018 at 04:15:04PM -0800, David Ahern wrote: >> How about the attached - make ping, ping6 and mz variables? Default to >> ping and mausezahn but let users update that name if necessary. Comments >> in the sample config to the effect need to be added to help users. > >Yes, looks good to me. Will include the change in the next version. Ido, please keep in mind that the script should run without the config file. So the defaults should be in lib.sh. Thanks!
Re: [RFC PATCH] e1000e: Remove Other from EIAC.
On 2018/01/18 15:50, Benjamin Poirier wrote: > It was reported that emulated e1000e devices in vmware esxi 6.5 Build > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver > overrun interrupt bursts", v4.15-rc1). Some tracing shows that after > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other() > on emulated e1000e devices. In comparison, on real e1000e 82574 hardware, > icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation. (_INT_ASSERTED | _LSC) > > Some experimentation showed that this flaw in vmware e1000e emulation can > be worked around by not setting Other in EIAC. This is how it was before > 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1). > > Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts") > Signed-off-by: Benjamin Poirier
Re: [RFC PATCH net-next 00/12] selftests: forwarding: Add VRF-based tests
Hi Andrew, Florian On Thu, Jan 18, 2018 at 12:11:11AM +0100, Andrew Lunn wrote: > > >> However, a similar kind of flexibility can be achieved by using VRFs and > > >> by looping the switch ports together. For example: > > >> > > >> br0 > > >> + > > >>vrf-h1 | vrf-h2 > > >> ++---+++ > > >> |||| > > >> 192.0.2.1/24 ++++ 192.0.2.2/24 > > >>swp1 swp2 swp3 swp4 > > >> ++++ > > >> |||| > > >> ++++ > > >> > > > Agreed this is really cool! For DSA enabled switches, we usually have a > > host that does the test sequencing and then execute commands remotely on > > the DUT, but we might be able to get such a similar framework up and > > running on the DUT itself without too much hassle. > > I think the problem we will have is a lack of ports. Most DSA switches > have 4 or 5 ports. Given the need for two ports per bridge port, we > will be limited to bridges with just two members. That really limits > what sort of tests you can do. I was actually interested in feedback from you guys. Looking at dsa_slave_changeupper() I see you don't forbid the enslavement to a VRF and that you set STP state to forwarding when a port leaves a bridge (good). Does that mean you're able to use some of these tests on your switches? The reason we can use these tests for mlxsw is that we support VRF and ACL offload. At least in the above example, swp4 is able to receive packets directed at 192.0.2.2 because we program the device with the host route 192.0.2.2/32.
Re: [RFC PATCH] e1000e: Remove Other from EIAC.
On 2018/01/18 15:50, Benjamin Poirier wrote: > It was reported that emulated e1000e devices in vmware esxi 6.5 Build > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver > overrun interrupt bursts", v4.15-rc1). Some tracing shows that after > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other() > on emulated e1000e devices. In comparison, on real e1000e 82574 hardware, > icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation. > > Some experimentation showed that this flaw in vmware e1000e emulation can > be worked around by not setting Other in EIAC. This is how it was before > 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1). vmware folks, please comment.
Re: AIO operation and CMSG
Am Mittwoch, 17. Januar 2018, 20:22:13 CET schrieb Christoph Hellwig: Hi Christoph, > On Sun, Jan 14, 2018 at 09:01:00AM +0100, Stephan Müller wrote: > > The syscall io_submit sends data to the kernel and invokes the respective > > handler function in the kernel such as the recvmsg handler. What I am > > wondering is whether there is a way to send CMSG data along with the > > io_submit syscall? If not, is CMSG handling with the AIO syscalls > > possible at all? > Not as-is, but it could be easily added by repurposing unused fields > in the iocb. The big question is if it should be done to the existing > IOCB_CMD_PREADV/IOCB_CMD_PWRITEV types, or if new SENDMSG/RECVMSG ones > should be added instead. Thank you for the clarification. I think for the moment we have found another solution that we are discussing at linux-crypto. Therefore, I would currently not see the need for an additional support of CMSG in AIO. Ciao Stephan
Re: [RFC PATCH net-next 02/12] selftests: forwarding: Add a test for FDB learning
On Wed, Jan 17, 2018 at 04:15:04PM -0800, David Ahern wrote: > How about the attached - make ping, ping6 and mz variables? Default to > ping and mausezahn but let users update that name if necessary. Comments > in the sample config to the effect need to be added to help users. Yes, looks good to me. Will include the change in the next version. Thanks!
[RFC PATCH] e1000e: Remove Other from EIAC.
It was reported that emulated e1000e devices in vmware esxi 6.5 Build 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts", v4.15-rc1). Some tracing shows that after e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other() on emulated e1000e devices. In comparison, on real e1000e 82574 hardware, icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation. Some experimentation showed that this flaw in vmware e1000e emulation can be worked around by not setting Other in EIAC. This is how it was before 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1). Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts") Signed-off-by: Benjamin Poirier --- Jeff, I'm sending as RFC since it looks like a problem that should be fixed in vmware. If you'd like to have the workaround in e1000e, I'll submit. --- drivers/net/ethernet/intel/e1000e/netdev.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 9f18d39bdc8f..625a4c9a86a4 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -1918,6 +1918,8 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data) bool enable = true; icr = er32(ICR); + ew32(ICR, E1000_ICR_OTHER); + if (icr & E1000_ICR_RXO) { ew32(ICR, E1000_ICR_RXO); enable = false; @@ -2040,7 +2042,6 @@ static void e1000_configure_msix(struct e1000_adapter *adapter) hw->hw_addr + E1000_EITR_82574(vector)); else writel(1, hw->hw_addr + E1000_EITR_82574(vector)); - adapter->eiac_mask |= E1000_IMS_OTHER; /* Cause Tx interrupts on every write back */ ivar |= BIT(31); @@ -2265,7 +2266,7 @@ static void e1000_irq_enable(struct e1000_adapter *adapter) if (adapter->msix_entries) { ew32(EIAC_82574, adapter->eiac_mask & E1000_EIAC_MASK_82574); - ew32(IMS, adapter->eiac_mask | E1000_IMS_LSC); + ew32(IMS, adapter->eiac_mask | E1000_IMS_OTHER | E1000_IMS_LSC); } else if (hw->mac.type >= e1000_pch_lpt) { ew32(IMS, IMS_ENABLE_MASK | E1000_IMS_ECCER); } else { -- 2.15.1
[PATCH net] netlink: reset extack earlier in netlink_rcv_skb
Move up the extack reset/initialization in netlink_rcv_skb, so that those 'goto ack' will not skip it. Otherwise, later on netlink_ack may use the uninitialized extack and cause kernel crash. Fixes: cbbdf8433a5f ("netlink: extack needs to be reset each time through loop") Reported-by: syzbot+03bee3680a3746677...@syzkaller.appspotmail.com Signed-off-by: Xin Long --- net/netlink/af_netlink.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 47ef2d8..84a4e4c 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2391,6 +2391,7 @@ int netlink_rcv_skb(struct sk_buff *skb, int (*cb)(struct sk_buff *, while (skb->len >= nlmsg_total_size(0)) { int msglen; + memset(&extack, 0, sizeof(extack)); nlh = nlmsg_hdr(skb); err = 0; @@ -2405,7 +2406,6 @@ int netlink_rcv_skb(struct sk_buff *skb, int (*cb)(struct sk_buff *, if (nlh->nlmsg_type < NLMSG_MIN_TYPE) goto ack; - memset(&extack, 0, sizeof(extack)); err = cb(skb, nlh, &extack); if (err == -EINTR) goto skip; -- 2.1.0
[PATCH net-next 0/5] add some features to hns3 driver
This patchset adds some features to hns3 driver, include the support for ethtool command -d, -p and support for manager table. [Patch 1/5] adds support for ethtool command -d, its ops is get_regs. driver will send command to command queue, and get regs number and regs value from command queue. [Patch 2/5] adds manager table initialization for hardware. [Patch 3/5 - 4/5] add support for ethtool command -p. Phy will control leds for copper ports. For fiber ports, driver sends command to command queue, and IMP will write SGPIO regs to control leds. [Patch 4/5] adds support for net status led for fiber ports. Net status include port speed, total rx/tx packets and link status. Driver send the status to command queue, and IMP will write SGPIO to control leds. Fuyun Liang (2): net: hns3: add support for get_regs net: hns3: add manager table initialization for hardware Jian Shen (3): net: hns3: add ethtool -p support for phy device net: hns3: add ethtool -p support for fiber port net: hns3: add net status led support for fiber port drivers/net/ethernet/hisilicon/hns3/hnae3.h| 5 +- drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 35 ++ .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h | 47 ++ .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 533 + .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 14 + 5 files changed, 633 insertions(+), 1 deletion(-) -- 2.9.3
[PATCH net-next 5/5] net: hns3: add net status led support for fiber port
From: Jian Shen Check the net status per second, include port speed, total rx/tx packets and link status. Updating the led status for fiber port. Signed-off-by: Jian Shen Signed-off-by: Peng Li --- .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h | 1 + .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 109 + .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 2 + 3 files changed, 112 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h index 122f862..3fd10a6 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h @@ -115,6 +115,7 @@ enum hclge_opcode_type { HCLGE_OPC_QUERY_LINK_STATUS = 0x0307, HCLGE_OPC_CONFIG_MAX_FRM_SIZE = 0x0308, HCLGE_OPC_CONFIG_SPEED_DUP = 0x0309, + HCLGE_OPC_STATS_MAC_TRAFFIC = 0x0314, /* MACSEC command */ /* PFC/Pause CMD*/ diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 039c90b..f3d9332 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -39,6 +39,7 @@ static int hclge_set_mta_filter_mode(struct hclge_dev *hdev, static int hclge_set_mtu(struct hnae3_handle *handle, int new_mtu); static int hclge_init_vlan_config(struct hclge_dev *hdev); static int hclge_reset_ae_dev(struct hnae3_ae_dev *ae_dev); +static int hclge_update_led_status(struct hclge_dev *hdev); static struct hnae3_ae_algo ae_algo; @@ -505,6 +506,38 @@ static int hclge_32_bit_update_stats(struct hclge_dev *hdev) return 0; } +static int hclge_mac_get_traffic_stats(struct hclge_dev *hdev) +{ + struct hclge_mac_stats *mac_stats = &hdev->hw_stats.mac_stats; + struct hclge_desc desc; + __le64 *desc_data; + int ret; + + /* for fiber port, need to query the total rx/tx packets statstics, +* used for data transferring checking. +*/ + if (hdev->hw.mac.media_type != HNAE3_MEDIA_TYPE_FIBER) + return 0; + + if (test_bit(HCLGE_STATE_STATISTICS_UPDATING, &hdev->state)) + return 0; + + hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_STATS_MAC_TRAFFIC, true); + ret = hclge_cmd_send(&hdev->hw, &desc, 1); + if (ret) { + dev_err(&hdev->pdev->dev, + "Get MAC total pkt stats fail, ret = %d\n", ret); + + return ret; + } + + desc_data = (__le64 *)(&desc.data[0]); + mac_stats->mac_tx_total_pkt_num += le64_to_cpu(*desc_data++); + mac_stats->mac_rx_total_pkt_num += le64_to_cpu(*desc_data); + + return 0; +} + static int hclge_mac_update_stats(struct hclge_dev *hdev) { #define HCLGE_MAC_CMD_NUM 21 @@ -2846,13 +2879,20 @@ static void hclge_service_task(struct work_struct *work) struct hclge_dev *hdev = container_of(work, struct hclge_dev, service_task); + /* The total rx/tx packets statstics are wanted to be updated +* per second. Both hclge_update_stats_for_all() and +* hclge_mac_get_traffic_stats() can do it. +*/ if (hdev->hw_stats.stats_timer >= HCLGE_STATS_TIMER_INTERVAL) { hclge_update_stats_for_all(hdev); hdev->hw_stats.stats_timer = 0; + } else { + hclge_mac_get_traffic_stats(hdev); } hclge_update_speed_duplex(hdev); hclge_update_link_status(hdev); + hclge_update_led_status(hdev); hclge_service_complete(hdev); } @@ -5965,6 +6005,75 @@ static int hclge_set_led_id(struct hnae3_handle *handle, return ret; } +enum hclge_led_port_speed { + HCLGE_SPEED_LED_FOR_1G, + HCLGE_SPEED_LED_FOR_10G, + HCLGE_SPEED_LED_FOR_25G, + HCLGE_SPEED_LED_FOR_40G, + HCLGE_SPEED_LED_FOR_50G, + HCLGE_SPEED_LED_FOR_100G, +}; + +static u8 hclge_led_get_speed_status(u32 speed) +{ + u8 speed_led; + + switch (speed) { + case HCLGE_MAC_SPEED_1G: + speed_led = HCLGE_SPEED_LED_FOR_1G; + break; + case HCLGE_MAC_SPEED_10G: + speed_led = HCLGE_SPEED_LED_FOR_10G; + break; + case HCLGE_MAC_SPEED_25G: + speed_led = HCLGE_SPEED_LED_FOR_25G; + break; + case HCLGE_MAC_SPEED_40G: + speed_led = HCLGE_SPEED_LED_FOR_40G; + break; + case HCLGE_MAC_SPEED_50G: + speed_led = HCLGE_SPEED_LED_FOR_50G; + break; + case HCLGE_MAC_SPEED_100G: + speed_led = HCLGE_SPEED_LED_FOR_100G; + break; + default: + speed_led = HCLGE_LED_NO_CHANGE; + } + + return speed_led; +} + +static int hclge_update_led_status(struct hclge_dev *hdev) +{ +
[PATCH net-next 1/5] net: hns3: add support for get_regs
From: Fuyun Liang This patch adds get_regs support for ethtool cmd. Signed-off-by: Fuyun Liang Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hnae3.h| 3 +- drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 23 +++ .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h | 4 + .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 176 + 4 files changed, 205 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index 634e932..d104ce5 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -356,7 +356,8 @@ struct hnae3_ae_ops { u32 stringset, u8 *data); int (*get_sset_count)(struct hnae3_handle *handle, int stringset); - void (*get_regs)(struct hnae3_handle *handle, void *data); + void (*get_regs)(struct hnae3_handle *handle, u32 *version, +void *data); int (*get_regs_len)(struct hnae3_handle *handle); u32 (*get_rss_key_size)(struct hnae3_handle *handle); diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c index 358f780..1c8b293 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c @@ -1063,6 +1063,27 @@ static int hns3_set_coalesce(struct net_device *netdev, return 0; } +static int hns3_get_regs_len(struct net_device *netdev) +{ + struct hnae3_handle *h = hns3_get_handle(netdev); + + if (!h->ae_algo->ops->get_regs_len) + return -EOPNOTSUPP; + + return h->ae_algo->ops->get_regs_len(h); +} + +static void hns3_get_regs(struct net_device *netdev, + struct ethtool_regs *cmd, void *data) +{ + struct hnae3_handle *h = hns3_get_handle(netdev); + + if (!h->ae_algo->ops->get_regs) + return; + + h->ae_algo->ops->get_regs(h, &cmd->version, data); +} + static const struct ethtool_ops hns3vf_ethtool_ops = { .get_drvinfo = hns3_get_drvinfo, .get_ringparam = hns3_get_ringparam, @@ -1103,6 +1124,8 @@ static const struct ethtool_ops hns3_ethtool_ops = { .set_channels = hns3_set_channels, .get_coalesce = hns3_get_coalesce, .set_coalesce = hns3_set_coalesce, + .get_regs_len = hns3_get_regs_len, + .get_regs = hns3_get_regs, }; void hns3_ethtool_set_ops(struct net_device *netdev) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h index 3c3159b..2561e7a 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h @@ -102,6 +102,10 @@ enum hclge_opcode_type { HCLGE_OPC_STATS_64_BIT = 0x0030, HCLGE_OPC_STATS_32_BIT = 0x0031, HCLGE_OPC_STATS_MAC = 0x0032, + + HCLGE_OPC_QUERY_REG_NUM = 0x0040, + HCLGE_OPC_QUERY_32_BIT_REG = 0x0041, + HCLGE_OPC_QUERY_64_BIT_REG = 0x0042, /* Device management command */ /* MAC commond */ diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 27f0ab6..c3d2cca 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -5544,6 +5544,180 @@ static int hclge_set_channels(struct hnae3_handle *handle, u32 new_tqps_num) return ret; } +static int hclge_get_regs_num(struct hclge_dev *hdev, u32 *regs_num_32_bit, + u32 *regs_num_64_bit) +{ + struct hclge_desc desc; + u32 total_num; + int ret; + + hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_QUERY_REG_NUM, true); + ret = hclge_cmd_send(&hdev->hw, &desc, 1); + if (ret) { + dev_err(&hdev->pdev->dev, + "Query register number cmd failed, ret = %d.\n", ret); + return ret; + } + + *regs_num_32_bit = le32_to_cpu(desc.data[0]); + *regs_num_64_bit = le32_to_cpu(desc.data[1]); + + total_num = *regs_num_32_bit + *regs_num_64_bit; + if (!total_num) + return -EINVAL; + + return 0; +} + +static int hclge_get_32_bit_regs(struct hclge_dev *hdev, u32 regs_num, +void *data) +{ +#define HCLGE_32_BIT_REG_RTN_DATANUM 8 + + struct hclge_desc *desc; + u32 *reg_val = data; + __le32 *desc_data; + int cmd_num; + int i, k, n; + int ret; + + if (regs_num == 0) + return 0; + + cmd_num = DIV_ROUND_UP(regs_num + 2, HCLGE_32_BIT_REG_RTN_DATANUM); + desc = kcalloc(cmd_num, sizeof(struct hclge_desc), GFP_KERNEL); + if (!desc) + return -ENOMEM; + + hc
[PATCH net-next 3/5] net: hns3: add ethtool -p support for phy device
From: Jian Shen Add led location support for phy device. The led will keep blinking with frequency 2HZ when locating. Signed-off-by: Jian Shen Signed-off-by: Peng Li --- drivers/net/ethernet/hisilicon/hns3/hnae3.h| 2 + drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 12 +++ .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 88 ++ .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 12 +++ 4 files changed, 114 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index d104ce5..fd06bc7 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -405,6 +405,8 @@ struct hnae3_ae_ops { int (*set_channels)(struct hnae3_handle *handle, u32 new_tqps_num); void (*get_flowctrl_adv)(struct hnae3_handle *handle, u32 *flowctrl_adv); + int (*set_led_id)(struct hnae3_handle *handle, + enum ethtool_phys_id_state status); }; struct hnae3_dcb_ops { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c index 1c8b293..7410205 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c @@ -1084,6 +1084,17 @@ static void hns3_get_regs(struct net_device *netdev, h->ae_algo->ops->get_regs(h, &cmd->version, data); } +static int hns3_set_phys_id(struct net_device *netdev, + enum ethtool_phys_id_state state) +{ + struct hnae3_handle *h = hns3_get_handle(netdev); + + if (!h->ae_algo || !h->ae_algo->ops || !h->ae_algo->ops->set_led_id) + return -EOPNOTSUPP; + + return h->ae_algo->ops->set_led_id(h, state); +} + static const struct ethtool_ops hns3vf_ethtool_ops = { .get_drvinfo = hns3_get_drvinfo, .get_ringparam = hns3_get_ringparam, @@ -1126,6 +1137,7 @@ static const struct ethtool_ops hns3_ethtool_ops = { .set_coalesce = hns3_set_coalesce, .get_regs_len = hns3_get_regs_len, .get_regs = hns3_get_regs, + .set_phys_id = hns3_set_phys_id, }; void hns3_ethtool_set_ops(struct net_device *netdev) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 6e64bed..73caf06 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -5819,6 +5819,93 @@ static void hclge_get_regs(struct hnae3_handle *handle, u32 *version, "Get 64 bit register failed, ret = %d.\n", ret); } +static int hclge_set_led_status_phy(struct phy_device *phydev, int value) +{ + int ret, cur_page; + + mutex_lock(&phydev->lock); + + ret = phy_read(phydev, HCLGE_PHY_PAGE_REG); + if (ret < 0) + goto out; + else + cur_page = ret; + + ret = phy_write(phydev, HCLGE_PHY_PAGE_REG, HCLGE_PHY_PAGE_LED); + if (ret) + goto out; + + ret = phy_write(phydev, HCLGE_LED_FC_REG, value); + if (ret) + goto out; + + ret = phy_write(phydev, HCLGE_PHY_PAGE_REG, cur_page); + +out: + mutex_unlock(&phydev->lock); + return ret; +} + +static int hclge_get_led_status_phy(struct phy_device *phydev, int *value) +{ + int ret, cur_page; + + mutex_lock(&phydev->lock); + + ret = phy_read(phydev, HCLGE_PHY_PAGE_REG); + if (ret < 0) + goto out; + else + cur_page = ret; + + ret = phy_write(phydev, HCLGE_PHY_PAGE_REG, HCLGE_PHY_PAGE_LED); + if (ret) + goto out; + + *value = phy_read(phydev, HCLGE_LED_FC_REG); + + ret = phy_write(phydev, HCLGE_PHY_PAGE_REG, cur_page); + +out: + mutex_unlock(&phydev->lock); + return ret; +} + +static int hclge_set_led_id(struct hnae3_handle *handle, + enum ethtool_phys_id_state status) +{ +#define BLINK_FREQUENCY2 + struct hclge_vport *vport = hclge_get_vport(handle); + struct hclge_dev *hdev = vport->back; + struct phy_device *phydev = hdev->hw.mac.phydev; + int ret = 0; + + if (!phydev) + return -EOPNOTSUPP; + + switch (status) { + case ETHTOOL_ID_ACTIVE: + ret = hclge_get_led_status_phy(phydev, &hdev->phy_led_val); + if (ret) + return ret; + return BLINK_FREQUENCY; + case ETHTOOL_ID_ON: + ret = hclge_set_led_status_phy(phydev, HCLGE_LED_FORCE_ON); + break; + case ETHTOOL_ID_OFF: + ret = hclge_set_led_status_phy(phydev, HCLGE_LED_FORCE_OFF); + break; + case ETHTOOL_ID_INACTIVE: + ret = hclge_set_led_status_phy(phydev, hdev->phy_led_val); +
[PATCH net-next 4/5] net: hns3: add ethtool -p support for fiber port
From: Jian Shen Add led location support for fiber port. The led will keep blinking when locating. Signed-off-by: Jian Shen Signed-off-by: Peng Li --- .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h | 20 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 101 - 2 files changed, 100 insertions(+), 21 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h index 1cd28e0..122f862 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h @@ -227,6 +227,9 @@ enum hclge_opcode_type { /* Mailbox cmd */ HCLGEVF_OPC_MBX_PF_TO_VF= 0x2000, + + /* Led command */ + HCLGE_OPC_LED_STATUS_CFG= 0xB000, }; #define HCLGE_TQP_REG_OFFSET 0x8 @@ -807,6 +810,23 @@ struct hclge_reset_cmd { #define HCLGE_NIC_CMQ_DESC_NUM 1024 #define HCLGE_NIC_CMQ_DESC_NUM_S 3 +#define HCLGE_LED_PORT_SPEED_STATE_S 0 +#define HCLGE_LED_PORT_SPEED_STATE_M GENMASK(5, 0) +#define HCLGE_LED_ACTIVITY_STATE_S 0 +#define HCLGE_LED_ACTIVITY_STATE_M GENMASK(1, 0) +#define HCLGE_LED_LINK_STATE_S 0 +#define HCLGE_LED_LINK_STATE_M GENMASK(1, 0) +#define HCLGE_LED_LOCATE_STATE_S 0 +#define HCLGE_LED_LOCATE_STATE_M GENMASK(1, 0) + +struct hclge_set_led_state_cmd { + u8 port_speed_led_config; + u8 link_led_config; + u8 activity_led_config; + u8 locate_led_config; + u8 rsv[20]; +}; + int hclge_cmd_init(struct hclge_dev *hdev); static inline void hclge_write_reg(void __iomem *base, u32 reg, u32 value) { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 73caf06..039c90b 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -5819,6 +5819,34 @@ static void hclge_get_regs(struct hnae3_handle *handle, u32 *version, "Get 64 bit register failed, ret = %d.\n", ret); } +static int hclge_set_led_status_sfp(struct hclge_dev *hdev, u8 speed_led_status, + u8 act_led_status, u8 link_led_status, + u8 locate_led_status) +{ + struct hclge_set_led_state_cmd *req; + struct hclge_desc desc; + int ret; + + hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_LED_STATUS_CFG, false); + + req = (struct hclge_set_led_state_cmd *)desc.data; + hnae_set_field(req->port_speed_led_config, HCLGE_LED_PORT_SPEED_STATE_M, + HCLGE_LED_PORT_SPEED_STATE_S, speed_led_status); + hnae_set_field(req->link_led_config, HCLGE_LED_ACTIVITY_STATE_M, + HCLGE_LED_ACTIVITY_STATE_S, act_led_status); + hnae_set_field(req->activity_led_config, HCLGE_LED_LINK_STATE_M, + HCLGE_LED_LINK_STATE_S, link_led_status); + hnae_set_field(req->locate_led_config, HCLGE_LED_LOCATE_STATE_M, + HCLGE_LED_LOCATE_STATE_S, locate_led_status); + + ret = hclge_cmd_send(&hdev->hw, &desc, 1); + if (ret) + dev_err(&hdev->pdev->dev, + "Send set led state cmd error, ret =%d\n", ret); + + return ret; +} + static int hclge_set_led_status_phy(struct phy_device *phydev, int value) { int ret, cur_page; @@ -5871,6 +5899,12 @@ static int hclge_get_led_status_phy(struct phy_device *phydev, int *value) return ret; } +enum hclge_led_status { + HCLGE_LED_OFF, + HCLGE_LED_ON, + HCLGE_LED_NO_CHANGE = 0xFF, +}; + static int hclge_set_led_id(struct hnae3_handle *handle, enum ethtool_phys_id_state status) { @@ -5880,27 +5914,52 @@ static int hclge_set_led_id(struct hnae3_handle *handle, struct phy_device *phydev = hdev->hw.mac.phydev; int ret = 0; - if (!phydev) - return -EOPNOTSUPP; - - switch (status) { - case ETHTOOL_ID_ACTIVE: - ret = hclge_get_led_status_phy(phydev, &hdev->phy_led_val); - if (ret) - return ret; - return BLINK_FREQUENCY; - case ETHTOOL_ID_ON: - ret = hclge_set_led_status_phy(phydev, HCLGE_LED_FORCE_ON); - break; - case ETHTOOL_ID_OFF: - ret = hclge_set_led_status_phy(phydev, HCLGE_LED_FORCE_OFF); - break; - case ETHTOOL_ID_INACTIVE: - ret = hclge_set_led_status_phy(phydev, hdev->phy_led_val); - break; - default: - ret = -EINVAL; - break; + if (phydev) { + switch (status) { + case ETHTOOL_ID_ACTIVE: + ret = hclge_get_led_status_phy(phydev, +
[PATCH net-next 2/5] net: hns3: add manager table initialization for hardware
From: Fuyun Liang The manager table is empty by default. If it is not initialized, the management pkgs like LLDP will be dropped by hardware. Default entries need to be added to manager table. Signed-off-by: Fuyun Liang Signed-off-by: Peng Li --- .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h | 22 + .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 101 + 2 files changed, 123 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h index 2561e7a..1cd28e0 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h @@ -605,6 +605,28 @@ struct hclge_mac_vlan_mask_entry_cmd { u8 rsv2[14]; }; +#define HCLGE_MAC_MGR_MASK_VLAN_B BIT(0) +#define HCLGE_MAC_MGR_MASK_MAC_B BIT(1) +#define HCLGE_MAC_MGR_MASK_ETHERTYPE_B BIT(2) +#define HCLGE_MAC_ETHERTYPE_LLDP 0x88cc + +struct hclge_mac_mgr_tbl_entry_cmd { + u8 flags; + u8 resp_code; + __le16 vlan_tag; + __le32 mac_addr_hi32; + __le16 mac_addr_lo16; + __le16 rsv1; + __le16 ethter_type; + __le16 egress_port; + __le16 egress_queue; + u8 sw_port_id_aware; + u8 rsv2; + u8 i_port_bitmap; + u8 i_port_direction; + u8 rsv3[2]; +}; + #define HCLGE_CFG_MTA_MAC_SEL_S0x0 #define HCLGE_CFG_MTA_MAC_SEL_MGENMASK(1, 0) #define HCLGE_CFG_MTA_MAC_EN_B 0x7 diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index c3d2cca..6e64bed 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -392,6 +392,16 @@ static const struct hclge_comm_stats_str g_mac_stats_string[] = { HCLGE_MAC_STATS_FIELD_OFF(mac_rx_send_app_bad_pkt_num)} }; +static const struct hclge_mac_mgr_tbl_entry_cmd hclge_mgr_table[] = { + { + .flags = HCLGE_MAC_MGR_MASK_VLAN_B, + .ethter_type = cpu_to_le16(HCLGE_MAC_ETHERTYPE_LLDP), + .mac_addr_hi32 = cpu_to_le32(htonl(0x0180C200)), + .mac_addr_lo16 = cpu_to_le16(htons(0x000E)), + .i_port_bitmap = 0x1, + }, +}; + static int hclge_64_bit_update_stats(struct hclge_dev *hdev) { #define HCLGE_64_BIT_CMD_NUM 5 @@ -4249,6 +4259,91 @@ int hclge_rm_mc_addr_common(struct hclge_vport *vport, return status; } +static int hclge_get_mac_ethertype_cmd_status(struct hclge_dev *hdev, + u16 cmdq_resp, u8 resp_code) +{ +#define HCLGE_ETHERTYPE_SUCCESS_ADD0 +#define HCLGE_ETHERTYPE_ALREADY_ADD1 +#define HCLGE_ETHERTYPE_MGR_TBL_OVERFLOW 2 +#define HCLGE_ETHERTYPE_KEY_CONFLICT 3 + + int return_status; + + if (cmdq_resp) { + dev_err(&hdev->pdev->dev, + "cmdq execute failed for get_mac_ethertype_cmd_status, status=%d.\n", + cmdq_resp); + return -EIO; + } + + switch (resp_code) { + case HCLGE_ETHERTYPE_SUCCESS_ADD: + case HCLGE_ETHERTYPE_ALREADY_ADD: + return_status = 0; + break; + case HCLGE_ETHERTYPE_MGR_TBL_OVERFLOW: + dev_err(&hdev->pdev->dev, + "add mac ethertype failed for manager table overflow.\n"); + return_status = -EIO; + break; + case HCLGE_ETHERTYPE_KEY_CONFLICT: + dev_err(&hdev->pdev->dev, + "add mac ethertype failed for key conflict.\n"); + return_status = -EIO; + break; + default: + dev_err(&hdev->pdev->dev, + "add mac ethertype failed for undefined, code=%d.\n", + resp_code); + return_status = -EIO; + } + + return return_status; +} + +static int hclge_add_mgr_tbl(struct hclge_dev *hdev, +const struct hclge_mac_mgr_tbl_entry_cmd *req) +{ + struct hclge_desc desc; + u8 resp_code; + u16 retval; + int ret; + + hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_MAC_ETHTYPE_ADD, false); + memcpy(desc.data, req, sizeof(struct hclge_mac_mgr_tbl_entry_cmd)); + + ret = hclge_cmd_send(&hdev->hw, &desc, 1); + if (ret) { + dev_err(&hdev->pdev->dev, + "add mac ethertype failed for cmd_send, ret =%d.\n", + ret); + return ret; + } + + resp_code = (le32_to_cpu(desc.data[0]) >> 8) & 0xff; + retval = le16_to_cpu(desc.retval); + + return hclge_get_mac_ethertype_cmd_status(hdev, retval, resp_code); +} + +static int init_mgr
Re: [PATCH net] gso: validate gso_type if SKB_GSO_DODGY
On Thu, Jan 18, 2018 at 12:20 AM, Willem de Bruijn wrote: > On Wed, Jan 17, 2018 at 10:48 PM, Jason Wang wrote: >> >> >> On 2018年01月18日 07:11, Willem de Bruijn wrote: >>> >>> From: Willem de Bruijn >>> >>> Validate gso_type of untrusted SKB_GSO_DODGY packets during >>> segmentation. >>> >>> Untrusted user packets are limited to a small set of gso types in >>> virtio_net_hdr_to_skb. But segmentation occurs on packet contents. >>> Syzkaller was able to enter gso callbacks that are not hardened >>> against untrusted user input. >>> >>> Fixes: f43798c27684 ("tun: Allow GSO using virtio_net_hdr") >> >> >> This commit is suspicious, I guess it should be 5c7cdf339af5 ("gso: Remove >> arbitrary checks for unsupported GSO") > > The specific SCTP path was introduced with commit 90017accff61 ("sctp: > add GSO support"). But the main issue that packets can be delivered to > gso handlers different from their gso_type goes back further. > > The commit you reference is actually older than the sctp gso patch, so > it makes sense that it did not have a check in the sctp_gso_segment. > > I still think that we should check in inet_gso_segment when we have > the proto, instead of in each {tcp, sctp, udp, esp, ...} handler having > a check of the form. > > !(type & (SKB_GSO_TCPV4 | > SKB_GSO_TCPV6 Unless we can create packets that legitimate combine SKB_GSO_DODGY with tunnel headers. virtio_net_hdr_to_skb does not accept tunneled gso types. But a tun device can be bridged with a gre tunnel in the host, creating a path that will call gre_gso_segment. If that is possible, then this patch is indeed too strict and we do need checks in the individual handlers.
[PATCH] ixgbe: fix ipv6 support for ipsec offload
Fix up the Rx path to watch for and decode ipv6 headers that might be carrying ipsec headers. To do so, we first change the search function to be able to take both ipv4 and ipv6 addresses from a pointer, and add an argument that tells which we are using. Then in the Rx handler we add a check for ipv4 vs ipv6 and then parse the headers accordingly. We can assume simple headers because this device won't decode packets with vlan or with ipv4/ipv6 extensions. We also change a flag used in the ...add_sa() function as it seems the XFRM stack doesn't actually ever set the XFRM_OFFLOAD_IPV6 flag bit. Signed-off-by: Shannon Nelson --- drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 39 ++ 1 file changed, 27 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c index 587fd8f..93eacdd 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c @@ -379,12 +379,13 @@ static int ixgbe_ipsec_find_empty_idx(struct ixgbe_ipsec *ipsec, bool rxtable) * @daddr: inbound address to match * @proto: protocol to match * @spi: SPI to match + * @ip4: true if using an ipv4 address * * Returns a pointer to the matching SA state information **/ static struct xfrm_state *ixgbe_ipsec_find_rx_state(struct ixgbe_ipsec *ipsec, - __be32 daddr, u8 proto, - __be32 spi) + __be32 *daddr, u8 proto, + __be32 spi, bool ip4) { struct rx_sa *rsa; struct xfrm_state *ret = NULL; @@ -392,7 +393,9 @@ static struct xfrm_state *ixgbe_ipsec_find_rx_state(struct ixgbe_ipsec *ipsec, rcu_read_lock(); hash_for_each_possible_rcu(ipsec->rx_sa_list, rsa, hlist, spi) if (spi == rsa->xs->id.spi && - daddr == rsa->xs->id.daddr.a4 && + ((ip4 && *daddr == rsa->xs->id.daddr.a4) || + (!ip4 && !memcmp(daddr, &rsa->xs->id.daddr.a6, + sizeof(rsa->xs->id.daddr.a6 && proto == rsa->xs->id.proto) { ret = rsa->xs; xfrm_state_hold(ret); @@ -505,7 +508,7 @@ static int ixgbe_ipsec_add_sa(struct xfrm_state *xs) } /* get ip for rx sa table */ - if (xs->xso.flags & XFRM_OFFLOAD_IPV6) + if (xs->props.family == AF_INET6) memcpy(rsa.ipaddr, &xs->id.daddr.a6, 16); else memcpy(&rsa.ipaddr[3], &xs->id.daddr.a4, 4); @@ -570,7 +573,7 @@ static int ixgbe_ipsec_add_sa(struct xfrm_state *xs) rsa.mode |= IXGBE_RXMOD_PROTO_ESP; if (rsa.decrypt) rsa.mode |= IXGBE_RXMOD_DECRYPT; - if (rsa.xs->xso.flags & XFRM_OFFLOAD_IPV6) + if (rsa.xs->props.family == AF_INET6) rsa.mode |= IXGBE_RXMOD_IPV6; /* the preparations worked, so save the info */ @@ -812,18 +815,30 @@ void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring, struct ixgbe_ipsec *ipsec = adapter->ipsec; struct xfrm_offload *xo = NULL; struct xfrm_state *xs = NULL; - struct iphdr *iph; - u8 *c_hdr; + struct ipv6hdr *ip6 = NULL; + struct iphdr *ip4 = NULL; + void *daddr; __be32 spi; + u8 *c_hdr; u8 proto; - /* we can assume no vlan header in the way, b/c the + /* Find the ip and crypto headers in the data. +* We can assume no vlan header in the way, b/c the * hw won't recognize the IPsec packet and anyway the * currently vlan device doesn't support xfrm offload. */ - /* TODO: not supporting IPv6 yet */ - iph = (struct iphdr *)(skb->data + ETH_HLEN); - c_hdr = (u8 *)iph + iph->ihl * 4; + if (pkt_info & cpu_to_le16(IXGBE_RXDADV_PKTTYPE_IPV4)) { + ip4 = (struct iphdr *)(skb->data + ETH_HLEN); + daddr = &ip4->daddr; + c_hdr = (u8 *)ip4 + ip4->ihl * 4; + } else if (pkt_info & cpu_to_le16(IXGBE_RXDADV_PKTTYPE_IPV6)) { + ip6 = (struct ipv6hdr *)(skb->data + ETH_HLEN); + daddr = &ip6->daddr; + c_hdr = (u8 *)ip6 + sizeof(struct ipv6hdr); + } else { + return; + } + switch (pkt_info & ipsec_pkt_types) { case cpu_to_le16(IXGBE_RXDADV_PKTTYPE_IPSEC_AH): spi = ((struct ip_auth_hdr *)c_hdr)->spi; @@ -837,7 +852,7 @@ void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring, return; } - xs = ixgbe_ipsec_find_rx_state(ipsec, iph->daddr, proto, spi); + xs = ixgbe_ipsec_find_rx_state(ipsec, daddr, proto, spi, !
Re: [PATCH net] gso: validate gso_type if SKB_GSO_DODGY
On Wed, Jan 17, 2018 at 10:48 PM, Jason Wang wrote: > > > On 2018年01月18日 07:11, Willem de Bruijn wrote: >> >> From: Willem de Bruijn >> >> Validate gso_type of untrusted SKB_GSO_DODGY packets during >> segmentation. >> >> Untrusted user packets are limited to a small set of gso types in >> virtio_net_hdr_to_skb. But segmentation occurs on packet contents. >> Syzkaller was able to enter gso callbacks that are not hardened >> against untrusted user input. >> >> Fixes: f43798c27684 ("tun: Allow GSO using virtio_net_hdr") > > > This commit is suspicious, I guess it should be 5c7cdf339af5 ("gso: Remove > arbitrary checks for unsupported GSO") The specific SCTP path was introduced with commit 90017accff61 ("sctp: add GSO support"). But the main issue that packets can be delivered to gso handlers different from their gso_type goes back further. The commit you reference is actually older than the sctp gso patch, so it makes sense that it did not have a check in the sctp_gso_segment. I still think that we should check in inet_gso_segment when we have the proto, instead of in each {tcp, sctp, udp, esp, ...} handler having a check of the form. !(type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6 Note that the above commit also only had these checks in the TSO branch if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) { but the case where this condition fails has to be protected just as well, so a revert of that patch + new tests in all handlers added since is not sufficient. >> Link:http://lkml.kernel.org/r/<001a1137452496ffc305617e5...@google.com> >> Reported-by:syzbot+fee64147a25aecd48...@syzkaller.appspotmail.com >> Signed-off-by: Willem de Bruijn >> --- >> net/ipv4/af_inet.c | 21 + >> net/ipv6/ip6_offload.c | 23 ++- >> 2 files changed, 43 insertions(+), 1 deletion(-) >> >> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c >> index f00499a46927..d5a36827f7b1 100644 >> --- a/net/ipv4/af_inet.c >> +++ b/net/ipv4/af_inet.c >> @@ -1220,6 +1220,25 @@ int inet_sk_rebuild_header(struct sock *sk) >> } >> EXPORT_SYMBOL(inet_sk_rebuild_header); >> +static bool inet_gso_validate_dodgy(struct sk_buff *skb, int ipproto) >> +{ >> + unsigned int gso_type = skb_shinfo(skb)->gso_type; >> + >> + if (gso_type & SKB_GSO_DODGY) { >> + switch (gso_type & ~SKB_GSO_DODGY) { >> + case SKB_GSO_TCPV4: >> + case SKB_GSO_TCPV4 | SKB_GSO_TCP_ECN: >> + return ipproto == IPPROTO_TCP; >> + case SKB_GSO_UDP: >> + return ipproto == IPPROTO_UDP; >> + default: >> + return false; >> + } >> + } >> + >> + return true; >> +} > > > This seems more strict than what was removed by 5c7cdf339af5. Any reason for > this? I'm asking since this probably work for virito-net but I'm not sure it > works for other sources. All sources are limited to the same VIRTIO_NET_HDR_GSO_.. types that virtio_net_hdr_to_skb supports, as far as I know.
Re: [PATCH net] net: validate untrusted gso packets
virtio_net_hdr_to_skb checks gso_type, but it does not verify that this type correctly describes the actual packet. Segmentation happens based on packet contents. So a packet was crafted to enter sctp gso, even though no such gso_type exists. This issue is not specific to sctp. >>> >>> >>> So it looks to me we should do it in here in sctp_gso_segment(). >>> >>> if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) { >>> /* Packet is from an untrusted source, reset gso_segs. */ >> >> No dodgy source can legitimately generate sctp code, so it should not >> even get there. > > > I think that's why we need extra check for dodgy packets here Commit 5c7cdf339af5 ("gso: Remove arbitrary checks for unsupported GSO") only compares gso_type against a list of allowed gso types. It did not compare gso_type against the actual packet contents, so would not have protected against this vulnerability. >, we used to > have gso_type verification here but was removed. In the future, virtio will > support passing sctp offload packet down for sure, so we need consider this > case. Then the sctp handler will first need to be fixed to not crash on bad packets. And every other handler for which no virtio gso type exists. >> Also, a packet can just as easily spoof an esp packet. >> See also the discussion in the Link above. > > > Then we probably need check there. Adding a check in every gso handler that does not expect packets with SKB_GSO_DODGY source seems backwards to me. A packet with gso_type SKB_GSO_TCPV4 should never be delivered to an sctp handler, say. We must check before passing to a handler whether the packet matches the callback. >> >> We can address this specific issue in segmentation by placing a check >> analogous to skb_validate_dodgy_gso in the network layer. Something >> like this (but with ECN option support): >> >> @@ -1258,6 +1258,22 @@ struct sk_buff *inet_gso_segment(struct sk_buff >> *skb, >> >> skb_reset_transport_header(skb); >> >> + gso_type = skb_shinfo(skb)->gso_type; >> + if (gso_type & SKB_GSO_DODGY) { >> + switch (gso_type & (SKB_GSO_TCPV4 | SKB_GSO_UDP)) { >> + case SKB_GSO_TCPV4: >> + if (proto != IPPROTO_TCP) >> + goto out; >> + break; >> + case SKB_GSO_UDP: >> + if (proto != IPPROTO_UDP) >> + goto out; >> + break; >> + default: >> + goto out; >> + } >> + } > > > This implements subset of function for codes which was removed by the commit > I mentioned below. No, as I explain above, it performs a different check. > >>> And we probably need to recover what has been removed since >>> 5c7cdf339af560f980b12eb6b0b5aa5f68ac6658 ("gso: Remove arbitrary checks >>> for >>> unsupported GSO"). >>> >> User packets can also have corrupted headers, tripping up segmentation >> logic that expects sane packets from the trusted protocol stack. >> Hardening all segmentation paths against all bad packets is error >> prone and slows down the common path, so validate on kernel entry. > > > I think evil packets should be rare in common case, so I'm not sure > validate > it on kernel entry is a good choice especially consider we've already > had > header check. This just makes that check more strict. Frequency of malicious packets is not really relevant if a single bad packet can cause damage. >>> >>> >>> We try hard to avoid flow dissector since its overhead is obvious. But >>> looks >>> like this patch did it unconditionally, and even for non gso packet. >>> The alternative to validate on kernel entry is to harden the entire segmentation layer and lower part of the stack. That is much harder to get right and not necessarily cheaper. >>> >>> >>> For performance reason. I think we should delay the check or segmentation >>> as >>> much as possible until it was really needed. >> >> Going through segmentation is probably as expensive as flow dissector, >> if not more so because of the indirect branches. > > > I think we don't even need to care about this consider the evil packet > should be rare. How does frequency matter when a single packet can crash a host? > And what you propose here is just a very small subset of the > necessary checking, more comes at gso header checking. So even if we care > performance, it only help for some specific case. It also fixed the bug that Eric sent a separate patch for, as that did not dissect as a valid TCP packet, either. >> And now we have two >> pieces of code that need to be hardened against malicious packets. > > > To me fixing the exist path is a good balance between security and > performance. > >> >> But I did overlook the guest to guest communication, with virtual devices >> that can
Re: [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc
On Wed, Jan 17, 2018 at 07:16:02PM -0800, Linus Torvalds wrote: > On Wed, Jan 17, 2018 at 7:06 PM, Al Viro wrote: > > > > Similar to the way put_cmsg() handles 32bit case on biarch > > targets, introduce a flag telling put_cmsg() to treat > > ->msg_control as kernel pointer, using memcpy instead of > > copy_to_user(). That allows to avoid the use of kernel_recvmsg() > > with its set_fs(). > > If this is the only case that kernel_recvmsg() exists for, then by all > means, that patch certainly looks like a good thing. In -next that's the only remaining caller. kernel_recvmsg() is { mm_segment_t oldfs = get_fs(); int result; iov_iter_kvec(&msg->msg_iter, READ | ITER_KVEC, vec, num, size); set_fs(KERNEL_DS); result = sock_recvmsg(sock, msg, flags); set_fs(oldfs); return result; } and iov_iter_kvec(&msg->msg_iter, READ | ITER_KVEC, vec, num, size); result = sock_recvmsg(sock, msg, flags); works just fine for copying the data - that gets handled by copy_to_iter() and copy_page_to_iter(). Those don't care about set_fs(); the trouble with sunrpc call site is that we want to fill msg_control-pointed kernel object. That gets copied by put_cmsg(). We could turn ->msg_control/->msg_controllen into another iov_iter, but seeing that we never do scatter-gather for those IMO that would be a massive overkill. A flag controlling whether ->msg_control is kernel or userland pointer would do, especially since we already have a flag for "do we want a native or compat layout for cmsg" in there. That's the only caller we need it for, but that thing looks cheap enough. Obviously needs to pass testing, including "is it too ugly to live as far as Davem is concerned" test, though...
Re: [PATCH v2 net 0/2] IB/ipoib: ip link support
On Mon, Jan 15, 2018 at 01:13:09PM -0500, David Miller wrote: > From: Denis Drozdov > Date: Tue, 9 Jan 2018 23:42:45 +0200 > > > IP link was broken due to the changes in IPoIB for the rdma_netdev > > support after commit cd565b4b51e5 > > ("IB/IPoIB: Support acceleration options callbacks"). > > > > This patchset restores IPoIB pkey creation and removal using rtnetlink. > > The first patch introduces changes in the rtnetlink code in order to allow > > IPOIB allocate and free the netdevice. > > > > The second patch establishes appropriate rtnetlink callbacks for IPoIB > > device and restores IPoIB netlink support > > > > Changes since v1: > > - Fixed double free_netdev calls in ops->free_link in netdev_run_todo > > - Removed priv_size from ipoib_link_ops as not required anymore. > > Please fix your control flow so that the existing netlink op can do > the right thing. Okay. Since this bug has been outstanding in RDMA for so long, I've looked pretty carefully at this.. This patch does go too far, but ipoib does appear to legitimately need something special here and no amount of 'control flow fixing' in ipoib will solve it. The fundamental issue is that ipoib no longer has a generic software netdev for the child interface. Instead the child interface is bound directly to one of several *full* hardware drivers. ie the child and the master are basically exactly the same HW dependent code. Since the child is a full hardware netdev, it has its own HW driven needs for priv_size, txqs, rxqs, etc. There is no 'one size fits all' value like for all the other software based newlink users. It depends on the HW device installed in the system (eg mlx4, mlx5, hfi1) The only problem with rtnl newlink is that it wants to use priv_size, txqs, and rxqs values from a singleton static global structure (eg ipoib_link_ops) which cannot know in advance which ipoib HW driver is in use by the specific parent the child is being created for. This is what needs to be fixed. ipoib simply needs to have parent-dependent inputs to the alloc_netdev_mqs in rtnl_create_link. The patch proposes struct net_device *(*alloc_link)(struct net *src_net, const char *dev_name, struct nlattr *tb[]); To let ipoib allocate the netdev, and then obviosly figure out the right parameters internally. This patch also adds some free related stuff, but that seems to be going too far. ipoib can use the normal free paths. The above is all that is necessary. An alternative would be something like: struct rtnl_alloc_params { size_t priv_size; int txqs; int rxqs; }; void (*get_alloc_parameters)(struct net_device *net, struct rtnl_alloc_params *parms); To let ipoib tell rtnl_create_link what parameters to use based on 'net' instead of taking them from the global singleton. Do you see another choice? Can you accept one of these options? Thanks, Jason (BTW, Doug & I are now co-maintaining RDMA, so I say the above with my maintainer hat on. This is a serious userspace visible regression bug on our side, and I really want to see it fixed.)
Re: [PATCH net] flow_dissector: properly cap thoff field
On 2018年01月18日 06:21, Eric Dumazet wrote: From: Eric Dumazet syzbot reported yet another crash [1] that is caused by insufficient validation of DODGY packets. Two bugs are happening here to trigger the crash. 1) Flow dissection leaves with incorrect thoff field. 2) skb_probe_transport_header() sets transport header to this invalid thoff, even if pointing after skb valid data. 3) qdisc_pkt_len_init() reads out-of-bound data because it trusts tcp_hdrlen(skb) Possible fixes : - Full flow dissector validation before injecting bad DODGY packets in the stack. This approach was attempted here : https://patchwork.ozlabs.org/patch/ 861874/ - Have more robust functions in the core. This might be needed anyway for stable versions. This patch fixes the flow dissection issue. [1] CPU: 1 PID: 3144 Comm: syzkaller271204 Not tainted 4.15.0-rc4-mm1+ #49 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:53 print_address_description+0x73/0x250 mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:355 [inline] kasan_report+0x23b/0x360 mm/kasan/report.c:413 __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:432 __tcp_hdrlen include/linux/tcp.h:35 [inline] tcp_hdrlen include/linux/tcp.h:40 [inline] qdisc_pkt_len_init net/core/dev.c:3160 [inline] __dev_queue_xmit+0x20d3/0x2200 net/core/dev.c:3465 dev_queue_xmit+0x17/0x20 net/core/dev.c:3554 packet_snd net/packet/af_packet.c:2943 [inline] packet_sendmsg+0x3ad5/0x60a0 net/packet/af_packet.c:2968 sock_sendmsg_nosec net/socket.c:628 [inline] sock_sendmsg+0xca/0x110 net/socket.c:638 sock_write_iter+0x31a/0x5d0 net/socket.c:907 call_write_iter include/linux/fs.h:1776 [inline] new_sync_write fs/read_write.c:469 [inline] __vfs_write+0x684/0x970 fs/read_write.c:482 vfs_write+0x189/0x510 fs/read_write.c:544 SYSC_write fs/read_write.c:589 [inline] SyS_write+0xef/0x220 fs/read_write.c:581 entry_SYSCALL_64_fastpath+0x1f/0x96 Fixes: 34fad54c2537 ("net: __skb_flow_dissect() must cap its return value") Fixes: a6e544b0a88b ("flow_dissector: Jump to exit code in __skb_flow_dissect") Signed-off-by: Eric Dumazet Cc: Willem de Bruijn Reported-by: syzbot --- net/core/flow_dissector.c |3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index 15ce300637650e17fcab7e378b20fe7972686d46..544bddf08e13c7f6e47aadc737244c9ba5af56b2 100644 --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -976,8 +976,8 @@ bool __skb_flow_dissect(const struct sk_buff *skb, out_good: ret = true; - key_control->thoff = (u16)nhoff; out: + key_control->thoff = min_t(u16, nhoff, skb ? skb->len : hlen); key_basic->n_proto = proto; key_basic->ip_proto = ip_proto; @@ -985,7 +985,6 @@ bool __skb_flow_dissect(const struct sk_buff *skb, out_bad: ret = false; - key_control->thoff = min_t(u16, nhoff, skb ? skb->len : hlen); goto out; } EXPORT_SYMBOL(__skb_flow_dissect); Acked-by: Jason Wang I would ask for a CVE for this. Thanks
Re: [PATCH 2/3] tcp: Add ESP encapsulation support
On Tue, Jan 16, 2018 at 11:28:23AM +0100, Steffen Klassert wrote: > > > Is there any chance this can be done with almost no change in TCP > > stack, using a layer model ? ( net/kcm comes to mind ) > > Herbert, would this be an option or is this not possible? Yes it can be done. I'm working on it. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH net] gso: validate gso_type if SKB_GSO_DODGY
On 2018年01月18日 07:11, Willem de Bruijn wrote: From: Willem de Bruijn Validate gso_type of untrusted SKB_GSO_DODGY packets during segmentation. Untrusted user packets are limited to a small set of gso types in virtio_net_hdr_to_skb. But segmentation occurs on packet contents. Syzkaller was able to enter gso callbacks that are not hardened against untrusted user input. Fixes: f43798c27684 ("tun: Allow GSO using virtio_net_hdr") This commit is suspicious, I guess it should be 5c7cdf339af5 ("gso: Remove arbitrary checks for unsupported GSO") Link:http://lkml.kernel.org/r/<001a1137452496ffc305617e5...@google.com> Reported-by:syzbot+fee64147a25aecd48...@syzkaller.appspotmail.com Signed-off-by: Willem de Bruijn --- net/ipv4/af_inet.c | 21 + net/ipv6/ip6_offload.c | 23 ++- 2 files changed, 43 insertions(+), 1 deletion(-) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index f00499a46927..d5a36827f7b1 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1220,6 +1220,25 @@ int inet_sk_rebuild_header(struct sock *sk) } EXPORT_SYMBOL(inet_sk_rebuild_header); +static bool inet_gso_validate_dodgy(struct sk_buff *skb, int ipproto) +{ + unsigned int gso_type = skb_shinfo(skb)->gso_type; + + if (gso_type & SKB_GSO_DODGY) { + switch (gso_type & ~SKB_GSO_DODGY) { + case SKB_GSO_TCPV4: + case SKB_GSO_TCPV4 | SKB_GSO_TCP_ECN: + return ipproto == IPPROTO_TCP; + case SKB_GSO_UDP: + return ipproto == IPPROTO_UDP; + default: + return false; + } + } + + return true; +} This seems more strict than what was removed by 5c7cdf339af5. Any reason for this? I'm asking since this probably work for virito-net but I'm not sure it works for other sources. Thanks
Re: [PATCH v2] r8152: disable RX aggregation on Dell TB16 dock
> On 18 Jan 2018, at 11:04 AM, Hayes Wang wrote: > > [...] >>> r8153 on Dell TB15/16 dock corrupts rx packets. >>> >>> This change is suggested by Realtek. They guess that the XHCI >>> controller doesn't have enough buffer, and their guesswork is correct, >>> once the RX aggregation gets disabled, the issue is gone. >>> >>> ASMedia is currently working on a real sulotion for this issue. >>> >>> Dell and ODM confirm the bcdDevice and iSerialNumber is unique for TB16. >>> >>> Note that TB15 has different bcdDevice and iSerialNumber, which are >>> not unique values. If you still have TB15, please contact Dell to >>> replace it with TB16. > > Excuse me. I don't understand why this patch is for specific USB nic rather > than xHCI. > It seems to make the specific USB nic working and the other ones keeping > error. This patch is transient, once ASMedia find the correct solution, the patch can be reverted. This patch only targets the builtin 8153 because externally plugged r8152 or asix do not suffered from this issue. Kai-Heng > > > Best Regards, > Hayes > >
Re: [PATCH net] net: validate untrusted gso packets
On 2018年01月17日 22:27, Willem de Bruijn wrote: On Wed, Jan 17, 2018 at 6:54 AM, Jason Wang wrote: On 2018年01月17日 12:33, Willem de Bruijn wrote: On Tue, Jan 16, 2018 at 11:04 PM, Jason Wang wrote: On 2018年01月17日 04:29, Willem de Bruijn wrote: From: Willem de Bruijn Validate gso packet type and headers on kernel entry. Reuse the info gathered by skb_probe_transport_header. Syzbot found two bugs by passing bad gso packets in packet sockets. Untrusted user packets are limited to a small set of gso types in virtio_net_hdr_to_skb. But segmentation occurs on packet contents. Syzkaller was able to enter gso callbacks that are not hardened against untrusted user input. Do this mean there's something missed in exist header check for dodgy packets? virtio_net_hdr_to_skb checks gso_type, but it does not verify that this type correctly describes the actual packet. Segmentation happens based on packet contents. So a packet was crafted to enter sctp gso, even though no such gso_type exists. This issue is not specific to sctp. So it looks to me we should do it in here in sctp_gso_segment(). if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) { /* Packet is from an untrusted source, reset gso_segs. */ No dodgy source can legitimately generate sctp code, so it should not even get there. I think that's why we need extra check for dodgy packets here, we used to have gso_type verification here but was removed. In the future, virtio will support passing sctp offload packet down for sure, so we need consider this case. Also, a packet can just as easily spoof an esp packet. See also the discussion in the Link above. Then we probably need check there. We can address this specific issue in segmentation by placing a check analogous to skb_validate_dodgy_gso in the network layer. Something like this (but with ECN option support): @@ -1258,6 +1258,22 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb, skb_reset_transport_header(skb); + gso_type = skb_shinfo(skb)->gso_type; + if (gso_type & SKB_GSO_DODGY) { + switch (gso_type & (SKB_GSO_TCPV4 | SKB_GSO_UDP)) { + case SKB_GSO_TCPV4: + if (proto != IPPROTO_TCP) + goto out; + break; + case SKB_GSO_UDP: + if (proto != IPPROTO_UDP) + goto out; + break; + default: + goto out; + } + } This implements subset of function for codes which was removed by the commit I mentioned below. And we probably need to recover what has been removed since 5c7cdf339af560f980b12eb6b0b5aa5f68ac6658 ("gso: Remove arbitrary checks for unsupported GSO"). User packets can also have corrupted headers, tripping up segmentation logic that expects sane packets from the trusted protocol stack. Hardening all segmentation paths against all bad packets is error prone and slows down the common path, so validate on kernel entry. I think evil packets should be rare in common case, so I'm not sure validate it on kernel entry is a good choice especially consider we've already had header check. This just makes that check more strict. Frequency of malicious packets is not really relevant if a single bad packet can cause damage. We try hard to avoid flow dissector since its overhead is obvious. But looks like this patch did it unconditionally, and even for non gso packet. The alternative to validate on kernel entry is to harden the entire segmentation layer and lower part of the stack. That is much harder to get right and not necessarily cheaper. For performance reason. I think we should delay the check or segmentation as much as possible until it was really needed. Going through segmentation is probably as expensive as flow dissector, if not more so because of the indirect branches. I think we don't even need to care about this consider the evil packet should be rare. And what you propose here is just a very small subset of the necessary checking, more comes at gso header checking. So even if we care performance, it only help for some specific case. And now we have two pieces of code that need to be hardened against malicious packets. To me fixing the exist path is a good balance between security and performance. But I did overlook the guest to guest communication, with virtual devices that can set NETIF_F_GSO_ROBUST and so do not enter the stack. That means that one guest can send misrepresented packets to another. Is that preferable to absorbing the cost of validation? The problem is that guests and hosts don't trust each other. Even if one packet has already been validated by one side, it must be validated again by another side when needed. Doing validation twice is meaningless in this case. The current patch does not actually deprecate the path through the segmentation la
Re: [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc
On Wed, Jan 17, 2018 at 7:06 PM, Al Viro wrote: > > Similar to the way put_cmsg() handles 32bit case on biarch > targets, introduce a flag telling put_cmsg() to treat > ->msg_control as kernel pointer, using memcpy instead of > copy_to_user(). That allows to avoid the use of kernel_recvmsg() > with its set_fs(). If this is the only case that kernel_recvmsg() exists for, then by all means, that patch certainly looks like a good thing. Linus
[PATCH bpf-next 1/8] bpf: arraymap: move checks out of alloc function
Use the new callback to perform allocation checks for array maps. The fd maps don't need a special allocation callback, they only need a special check callback. Signed-off-by: Jakub Kicinski Reviewed-by: Quentin Monnet --- kernel/bpf/arraymap.c | 42 -- 1 file changed, 28 insertions(+), 14 deletions(-) diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index ab94d304a634..68336092bfb4 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -49,27 +49,35 @@ static int bpf_array_alloc_percpu(struct bpf_array *array) } /* Called from syscall */ -static struct bpf_map *array_map_alloc(union bpf_attr *attr) +static int array_map_alloc_check(union bpf_attr *attr) { bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY; int numa_node = bpf_map_attr_numa_node(attr); - u32 elem_size, index_mask, max_entries; - bool unpriv = !capable(CAP_SYS_ADMIN); - struct bpf_array *array; - u64 array_size, mask64; /* check sanity of attributes */ if (attr->max_entries == 0 || attr->key_size != 4 || attr->value_size == 0 || attr->map_flags & ~ARRAY_CREATE_FLAG_MASK || (percpu && numa_node != NUMA_NO_NODE)) - return ERR_PTR(-EINVAL); + return -EINVAL; if (attr->value_size > KMALLOC_MAX_SIZE) /* if value_size is bigger, the user space won't be able to * access the elements. */ - return ERR_PTR(-E2BIG); + return -E2BIG; + + return 0; +} + +static struct bpf_map *array_map_alloc(union bpf_attr *attr) +{ + bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY; + int numa_node = bpf_map_attr_numa_node(attr); + u32 elem_size, index_mask, max_entries; + bool unpriv = !capable(CAP_SYS_ADMIN); + struct bpf_array *array; + u64 array_size, mask64; elem_size = round_up(attr->value_size, 8); @@ -327,6 +335,7 @@ static void array_map_free(struct bpf_map *map) } const struct bpf_map_ops array_map_ops = { + .map_alloc_check = array_map_alloc_check, .map_alloc = array_map_alloc, .map_free = array_map_free, .map_get_next_key = array_map_get_next_key, @@ -337,6 +346,7 @@ const struct bpf_map_ops array_map_ops = { }; const struct bpf_map_ops percpu_array_map_ops = { + .map_alloc_check = array_map_alloc_check, .map_alloc = array_map_alloc, .map_free = array_map_free, .map_get_next_key = array_map_get_next_key, @@ -345,12 +355,12 @@ const struct bpf_map_ops percpu_array_map_ops = { .map_delete_elem = array_map_delete_elem, }; -static struct bpf_map *fd_array_map_alloc(union bpf_attr *attr) +static int fd_array_map_alloc_check(union bpf_attr *attr) { /* only file descriptors can be stored in this type of map */ if (attr->value_size != sizeof(u32)) - return ERR_PTR(-EINVAL); - return array_map_alloc(attr); + return -EINVAL; + return array_map_alloc_check(attr); } static void fd_array_map_free(struct bpf_map *map) @@ -474,7 +484,8 @@ void bpf_fd_array_map_clear(struct bpf_map *map) } const struct bpf_map_ops prog_array_map_ops = { - .map_alloc = fd_array_map_alloc, + .map_alloc_check = fd_array_map_alloc_check, + .map_alloc = array_map_alloc, .map_free = fd_array_map_free, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = fd_array_map_lookup_elem, @@ -561,7 +572,8 @@ static void perf_event_fd_array_release(struct bpf_map *map, } const struct bpf_map_ops perf_event_array_map_ops = { - .map_alloc = fd_array_map_alloc, + .map_alloc_check = fd_array_map_alloc_check, + .map_alloc = array_map_alloc, .map_free = fd_array_map_free, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = fd_array_map_lookup_elem, @@ -592,7 +604,8 @@ static void cgroup_fd_array_free(struct bpf_map *map) } const struct bpf_map_ops cgroup_array_map_ops = { - .map_alloc = fd_array_map_alloc, + .map_alloc_check = fd_array_map_alloc_check, + .map_alloc = array_map_alloc, .map_free = cgroup_fd_array_free, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = fd_array_map_lookup_elem, @@ -610,7 +623,7 @@ static struct bpf_map *array_of_map_alloc(union bpf_attr *attr) if (IS_ERR(inner_map_meta)) return inner_map_meta; - map = fd_array_map_alloc(attr); + map = array_map_alloc(attr); if (IS_ERR(map)) { bpf_map_meta_free(inner_map_meta); return map; @@ -673,6 +686,7 @@ static u32 array_of_map_gen_lookup(struct bpf_map *map, } const struct bpf_map_ops array_of_maps_map_ops = { + .map_alloc_check = fd_array_map_alloc_check, .map_alloc = array_of_map_alloc,
[PATCH bpf-next 8/8] nfp: bpf: add short busy wait for FW replies
Scheduling out and in for every FW message can slow us down unnecessarily. Our experiments show that even under heavy load the FW responds to 99.9% messages within 200 us. Add a short busy wait before entering the wait queue. Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/bpf/cmsg.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c index 71e6586acc36..80d3aa0fc9d3 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c @@ -157,7 +157,14 @@ nfp_bpf_cmsg_wait_reply(struct nfp_app_bpf *bpf, enum nfp_bpf_cmsg_type type, int tag) { struct sk_buff *skb; - int err; + int i, err; + + for (i = 0; i < 50; i++) { + udelay(4); + skb = nfp_bpf_reply(bpf, tag); + if (skb) + return skb; + } err = wait_event_interruptible_timeout(bpf->cmsg_wq, skb = nfp_bpf_reply(bpf, tag), -- 2.15.1
[PATCH bpf-next 7/8] selftest/bpf: extend the offload test with map checks
Check map device information is reported correctly, and perform basic map operations. Check device destruction gets rid of the maps and map allocation failure path by telling netdevsim to reject map offload via DebugFS. Signed-off-by: Jakub Kicinski --- tools/testing/selftests/bpf/Makefile | 3 +- tools/testing/selftests/bpf/sample_map_ret0.c | 34 + tools/testing/selftests/bpf/test_offload.py | 206 +++--- 3 files changed, 218 insertions(+), 25 deletions(-) create mode 100644 tools/testing/selftests/bpf/sample_map_ret0.c diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index a8aa7e251c8e..3a44b655d852 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -19,7 +19,8 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test_obj_id.o \ test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o \ sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \ - test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o + test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \ + sample_map_ret0.o TEST_PROGS := test_kmod.sh test_xdp_redirect.sh test_xdp_meta.sh \ test_offload.py diff --git a/tools/testing/selftests/bpf/sample_map_ret0.c b/tools/testing/selftests/bpf/sample_map_ret0.c new file mode 100644 index ..0756303676ac --- /dev/null +++ b/tools/testing/selftests/bpf/sample_map_ret0.c @@ -0,0 +1,34 @@ +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) */ +#include +#include "bpf_helpers.h" + +struct bpf_map_def SEC("maps") htab = { + .type = BPF_MAP_TYPE_HASH, + .key_size = sizeof(__u32), + .value_size = sizeof(long), + .max_entries = 2, +}; + +struct bpf_map_def SEC("maps") array = { + .type = BPF_MAP_TYPE_ARRAY, + .key_size = sizeof(__u32), + .value_size = sizeof(long), + .max_entries = 2, +}; + +/* Sample program which should always load for testing control paths. */ +SEC(".text") int func() +{ + __u64 key64 = 0; + __u32 key = 0; + long *value; + + value = bpf_map_lookup_elem(&htab, &key); + if (!value) + return 1; + value = bpf_map_lookup_elem(&array, &key64); + if (!value) + return 1; + + return 0; +} diff --git a/tools/testing/selftests/bpf/test_offload.py b/tools/testing/selftests/bpf/test_offload.py index e3c750f17cb8..833b9c1ec450 100755 --- a/tools/testing/selftests/bpf/test_offload.py +++ b/tools/testing/selftests/bpf/test_offload.py @@ -20,6 +20,7 @@ import os import pprint import random import string +import struct import subprocess import time @@ -156,6 +157,14 @@ netns = [] # net namespaces to be removed (len(progs), expected)) return progs +def bpftool_map_list(expected=None, ns=""): +_, maps = bpftool("map show", JSON=True, ns=ns, fail=True) +if expected is not None: +if len(maps) != expected: +fail(True, "%d BPF maps loaded, expected %d" % + (len(maps), expected)) +return maps + def bpftool_prog_list_wait(expected=0, n_retry=20): for i in range(n_retry): nprogs = len(bpftool_prog_list()) @@ -164,6 +173,14 @@ netns = [] # net namespaces to be removed time.sleep(0.05) raise Exception("Time out waiting for program counts to stabilize want %d, have %d" % (expected, nprogs)) +def bpftool_map_list_wait(expected=0, n_retry=20): +for i in range(n_retry): +nmaps = len(bpftool_map_list()) +if nmaps == expected: +return +time.sleep(0.05) +raise Exception("Time out waiting for map counts to stabilize want %d, have %d" % (expected, nmaps)) + def ip(args, force=False, JSON=True, ns="", fail=True): if force: args = "-force " + args @@ -193,6 +210,26 @@ netns = [] # net namespaces to be removed return name return None +def int2str(fmt, val): +ret = [] +for b in struct.pack(fmt, val): +ret.append(int(b)) +return " ".join(map(lambda x: str(x), ret)) + +def str2int(strtab): +inttab = [] +for i in strtab: +inttab.append(int(i, 16)) +ba = bytearray(inttab) +if len(strtab) == 4: +fmt = "I" +elif len(strtab) == 8: +fmt = "Q" +else: +raise Exception("String array of len %d can't be unpacked to an int" % +(len(strtab))) +return struct.unpack(fmt, ba)[0] + class DebugfsDir: """ Class for accessing DebugFS directories as a dictionary. @@ -311,13 +348,13 @@ netns = [] # net namespaces to be removed return ip("link set dev %s mtu %d" % (self.dev["ifname"], mtu), fail=fail) -def set_xdp(self, bpf, mode, force=False, fail=Tr
[PATCH bpf-next 6/8] netdevsim: bpf: support fake map offload
Add to netdevsim ability to pretend it's offloading BPF maps. We only allow allocation of tiny 2 entry maps, to keep things simple. Mutex lock may seem heavy for the operations we perform, but we want to make sure callbacks can sleep. Signed-off-by: Jakub Kicinski --- drivers/net/netdevsim/bpf.c | 246 ++ drivers/net/netdevsim/netdevsim.h | 3 + 2 files changed, 249 insertions(+) diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c index 5134d5c1306c..b3851bbefad3 100644 --- a/drivers/net/netdevsim/bpf.c +++ b/drivers/net/netdevsim/bpf.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include @@ -31,6 +32,19 @@ struct nsim_bpf_bound_prog { struct list_head l; }; +#define NSIM_BPF_MAX_KEYS 2 + +struct nsim_bpf_bound_map { + struct netdevsim *ns; + struct bpf_offloaded_map *map; + struct mutex mutex; + struct nsim_map_entry { + void *key; + void *value; + } entry[NSIM_BPF_MAX_KEYS]; + struct list_head l; +}; + static int nsim_debugfs_bpf_string_read(struct seq_file *file, void *data) { const char **str = file->private; @@ -284,6 +298,224 @@ nsim_setup_prog_hw_checks(struct netdevsim *ns, struct netdev_bpf *bpf) return 0; } +static bool +nsim_map_key_match(struct bpf_map *map, struct nsim_map_entry *e, void *key) +{ + return e->key && !memcmp(key, e->key, map->key_size); +} + +static int nsim_map_key_find(struct bpf_offloaded_map *offmap, void *key) +{ + struct nsim_bpf_bound_map *nmap = offmap->dev_priv; + unsigned int i; + + for (i = 0; i < ARRAY_SIZE(nmap->entry); i++) + if (nsim_map_key_match(&offmap->map, &nmap->entry[i], key)) + return i; + + return -ENOENT; +} + +static int +nsim_map_alloc_elem(struct bpf_offloaded_map *offmap, unsigned int idx) +{ + struct nsim_bpf_bound_map *nmap = offmap->dev_priv; + + nmap->entry[idx].key = kmalloc(offmap->map.key_size, GFP_USER); + if (!nmap->entry[idx].key) + return -ENOMEM; + nmap->entry[idx].value = kmalloc(offmap->map.value_size, GFP_USER); + if (!nmap->entry[idx].value) { + kfree(nmap->entry[idx].key); + nmap->entry[idx].key = NULL; + return -ENOMEM; + } + + return 0; +} + +static int +nsim_map_get_next_key(struct bpf_offloaded_map *offmap, + void *key, void *next_key) +{ + struct nsim_bpf_bound_map *nmap = offmap->dev_priv; + int idx = -ENOENT; + + mutex_lock(&nmap->mutex); + + if (key) + idx = nsim_map_key_find(offmap, key); + if (idx == -ENOENT) + idx = 0; + else + idx++; + + for (; idx < ARRAY_SIZE(nmap->entry); idx++) { + if (nmap->entry[idx].key) { + memcpy(next_key, nmap->entry[idx].key, + offmap->map.key_size); + break; + } + } + + mutex_unlock(&nmap->mutex); + + if (idx == ARRAY_SIZE(nmap->entry)) + return -ENOENT; + return 0; +} + +static int +nsim_map_lookup_elem(struct bpf_offloaded_map *offmap, void *key, void *value) +{ + struct nsim_bpf_bound_map *nmap = offmap->dev_priv; + int idx; + + mutex_lock(&nmap->mutex); + + idx = nsim_map_key_find(offmap, key); + if (idx >= 0) + memcpy(value, nmap->entry[idx].value, offmap->map.value_size); + + mutex_unlock(&nmap->mutex); + + return idx < 0 ? idx : 0; +} + +static int +nsim_map_update_elem(struct bpf_offloaded_map *offmap, +void *key, void *value, u64 flags) +{ + struct nsim_bpf_bound_map *nmap = offmap->dev_priv; + int idx, err = 0; + + mutex_lock(&nmap->mutex); + + idx = nsim_map_key_find(offmap, key); + if (idx < 0 && flags == BPF_EXIST) { + err = idx; + goto exit_unlock; + } + if (idx >= 0 && flags == BPF_NOEXIST) { + err = -EEXIST; + goto exit_unlock; + } + + if (idx < 0) { + for (idx = 0; idx < ARRAY_SIZE(nmap->entry); idx++) + if (!nmap->entry[idx].key) + break; + if (idx == ARRAY_SIZE(nmap->entry)) { + err = -E2BIG; + goto exit_unlock; + } + + err = nsim_map_alloc_elem(offmap, idx); + if (err) + goto exit_unlock; + } + + memcpy(nmap->entry[idx].key, key, offmap->map.key_size); + memcpy(nmap->entry[idx].value, value, offmap->map.value_size); +exit_unlock: + mutex_unlock(&nmap->mutex); + + return err; +} + +static int nsim_map_delete_elem(struct bpf_offloaded_map *offmap, void *key) +{ + stru
[PATCH bpf-next 5/8] tools: bpftool: report device information for offloaded maps
Print the information about device on which map is created. Signed-off-by: Jakub Kicinski --- tools/bpf/bpftool/map.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c index 8d7db9d6b9cd..a152c1a5c94c 100644 --- a/tools/bpf/bpftool/map.c +++ b/tools/bpf/bpftool/map.c @@ -428,6 +428,9 @@ static int show_map_close_json(int fd, struct bpf_map_info *info) jsonw_name(json_wtr, "flags"); jsonw_printf(json_wtr, "%#x", info->map_flags); + + print_dev_json(info->ifindex, info->netns_dev, info->netns_ino); + jsonw_uint_field(json_wtr, "bytes_key", info->key_size); jsonw_uint_field(json_wtr, "bytes_value", info->value_size); jsonw_uint_field(json_wtr, "max_entries", info->max_entries); @@ -469,7 +472,9 @@ static int show_map_close_plain(int fd, struct bpf_map_info *info) if (*info->name) printf("name %s ", info->name); - printf("flags 0x%x\n", info->map_flags); + printf("flags 0x%x", info->map_flags); + print_dev_plain(info->ifindex, info->netns_dev, info->netns_ino); + printf("\n"); printf("\tkey %uB value %uB max_entries %u", info->key_size, info->value_size, info->max_entries); -- 2.15.1
[PATCH bpf-next 3/8] bpf: offload: allow array map offload
The special handling of different map types is left to the driver. Allow offload of array maps by simply adding it to accepted types. For nfp we have to make sure array elements are not deleted. Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/bpf/offload.c | 2 ++ kernel/bpf/offload.c | 3 ++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c index c452bf9462e0..1a357aacc444 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c @@ -176,6 +176,8 @@ nfp_bpf_map_get_next_key(struct bpf_offloaded_map *offmap, static int nfp_bpf_map_delete_elem(struct bpf_offloaded_map *offmap, void *key) { + if (offmap->map.map_type == BPF_MAP_TYPE_ARRAY) + return -EINVAL; return nfp_bpf_ctrl_del_entry(offmap, key); } diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index 6c0baa1cf8f8..2657976aec2a 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -299,7 +299,8 @@ struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr) if (!capable(CAP_SYS_ADMIN)) return ERR_PTR(-EPERM); - if (attr->map_type != BPF_MAP_TYPE_HASH) + if (attr->map_type != BPF_MAP_TYPE_ARRAY && + attr->map_type != BPF_MAP_TYPE_HASH) return ERR_PTR(-EINVAL); offmap = kzalloc(sizeof(*offmap), GFP_USER); -- 2.15.1
[PATCH bpf-next 4/8] bpf: offload: report device information about offloaded maps
Tell user space about device on which the map was created. Unfortunate reality of user ABI makes sharing this code with program offload difficult but the information is the same. Signed-off-by: Jakub Kicinski --- include/linux/bpf.h| 2 ++ include/uapi/linux/bpf.h | 3 +++ kernel/bpf/offload.c | 55 ++ kernel/bpf/syscall.c | 6 + tools/include/uapi/linux/bpf.h | 3 +++ 5 files changed, 69 insertions(+) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 025b1c2f8053..66df387106de 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -586,6 +586,8 @@ void bpf_prog_offload_destroy(struct bpf_prog *prog); int bpf_prog_offload_info_fill(struct bpf_prog_info *info, struct bpf_prog *prog); +int bpf_map_offload_info_fill(struct bpf_map_info *info, struct bpf_map *map); + int bpf_map_offload_lookup_elem(struct bpf_map *map, void *key, void *value); int bpf_map_offload_update_elem(struct bpf_map *map, void *key, void *value, u64 flags); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 7c2259e8bc54..af1f49ad8b88 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -938,6 +938,9 @@ struct bpf_map_info { __u32 max_entries; __u32 map_flags; char name[BPF_OBJ_NAME_LEN]; + __u32 ifindex; + __u64 netns_dev; + __u64 netns_ino; } __attribute__((aligned(8))); /* User bpf_sock_ops struct to access socket values and specify request ops diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index 2657976aec2a..c9401075b58c 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -413,6 +413,61 @@ int bpf_map_offload_get_next_key(struct bpf_map *map, void *key, void *next_key) return ret; } +struct ns_get_path_bpf_map_args { + struct bpf_offloaded_map *offmap; + struct bpf_map_info *info; +}; + +static struct ns_common *bpf_map_offload_info_fill_ns(void *private_data) +{ + struct ns_get_path_bpf_map_args *args = private_data; + struct ns_common *ns; + struct net *net; + + rtnl_lock(); + down_read(&bpf_devs_lock); + + if (args->offmap->netdev) { + args->info->ifindex = args->offmap->netdev->ifindex; + net = dev_net(args->offmap->netdev); + get_net(net); + ns = &net->ns; + } else { + args->info->ifindex = 0; + ns = NULL; + } + + up_read(&bpf_devs_lock); + rtnl_unlock(); + + return ns; +} + +int bpf_map_offload_info_fill(struct bpf_map_info *info, struct bpf_map *map) +{ + struct ns_get_path_bpf_map_args args = { + .offmap = map_to_offmap(map), + .info = info, + }; + struct inode *ns_inode; + struct path ns_path; + void *res; + + res = ns_get_path_cb(&ns_path, bpf_map_offload_info_fill_ns, &args); + if (IS_ERR(res)) { + if (!info->ifindex) + return -ENODEV; + return PTR_ERR(res); + } + + ns_inode = ns_path.dentry->d_inode; + info->netns_dev = new_encode_dev(ns_inode->i_sb->s_dev); + info->netns_ino = ns_inode->i_ino; + path_put(&ns_path); + + return 0; +} + bool bpf_offload_dev_match(struct bpf_prog *prog, struct bpf_map *map) { struct bpf_offloaded_map *offmap; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index c28524483bf4..abc842aee032 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1799,6 +1799,12 @@ static int bpf_map_get_info_by_fd(struct bpf_map *map, info.map_flags = map->map_flags; memcpy(info.name, map->name, sizeof(map->name)); + if (bpf_map_is_dev_bound(map)) { + err = bpf_map_offload_info_fill(&info, map); + if (err) + return err; + } + if (copy_to_user(uinfo, &info, info_len) || put_user(info_len, &uattr->info.info_len)) return -EFAULT; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 69f96af4a569..2d57c73ecacf 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -935,6 +935,9 @@ struct bpf_map_info { __u32 max_entries; __u32 map_flags; char name[BPF_OBJ_NAME_LEN]; + __u32 ifindex; + __u64 netns_dev; + __u64 netns_ino; } __attribute__((aligned(8))); /* User bpf_sock_ops struct to access socket values and specify request ops -- 2.15.1
[pull request bpf-next 0/8] bpf: offload: array maps, reporting and test
Hi! This set bring in the rest of map offload code held up by urgent fixes and improvements to the BPF arrays. The first 3 patches take care of array map offload, similarly to hash maps the attribute validation is split out to a separate map op, and used for both offloaded and non-offloaded case (allocation only happens if map is on the host). Offload support comes down to allowing this map type through the offload check in the core. NFP driver also rejects the delete operation in case of array maps. Subsequent patches add reporting of target device in a very similar way target device of programs is reported (ifindex+netns dev/ino). Netdevsim is extended with a trivial map implementation allowing us to test the offload in test_offload.py. Last patch adds a small busy wait to NFP map IO, this improves the response times which is especially useful for map dumps. The following changes since commit e2e3224122e64ebe15fe02a63e8fe09b64a8c743: samples/bpf: xdp2skb_meta comment explain why pkt-data pointers are invalidated (2018-01-18 01:49:09 +0100) are available in the Git repository at: g...@gitolite.kernel.org:pub/scm/linux/kernel/git/kuba/linux bpf-offload-array-maps-and-reporting for you to fetch changes up to cebe56e2d9efaa883e0ea98b396328ed72fe1c46: nfp: bpf: add short busy wait for FW replies (2018-01-17 19:00:12 -0800) Jakub Kicinski (8): bpf: arraymap: move checks out of alloc function bpf: arraymap: use bpf_map_init_from_attr() bpf: offload: allow array map offload bpf: offload: report device information about offloaded maps tools: bpftool: report device information for offloaded maps netdevsim: bpf: support fake map offload selftest/bpf: extend the offload test with map checks nfp: bpf: add short busy wait for FW replies drivers/net/ethernet/netronome/nfp/bpf/cmsg.c| 9 +- drivers/net/ethernet/netronome/nfp/bpf/offload.c | 2 + drivers/net/netdevsim/bpf.c | 246 +++ drivers/net/netdevsim/netdevsim.h| 3 + include/linux/bpf.h | 2 + include/uapi/linux/bpf.h | 3 + kernel/bpf/arraymap.c| 49 +++-- kernel/bpf/offload.c | 58 +- kernel/bpf/syscall.c | 6 + tools/bpf/bpftool/map.c | 7 +- tools/include/uapi/linux/bpf.h | 3 + tools/testing/selftests/bpf/Makefile | 3 +- tools/testing/selftests/bpf/sample_map_ret0.c| 34 tools/testing/selftests/bpf/test_offload.py | 206 --- 14 files changed, 583 insertions(+), 48 deletions(-) create mode 100644 tools/testing/selftests/bpf/sample_map_ret0.c
[PATCH bpf-next 2/8] bpf: arraymap: use bpf_map_init_from_attr()
Arraymap was not converted to use bpf_map_init_from_attr() to avoid merge conflicts with emergency fixes. Do it now. Signed-off-by: Jakub Kicinski --- kernel/bpf/arraymap.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 68336092bfb4..b1f66480135b 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -120,12 +120,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr) array->map.unpriv_array = unpriv; /* copy mandatory map attributes */ - array->map.map_type = attr->map_type; - array->map.key_size = attr->key_size; - array->map.value_size = attr->value_size; - array->map.max_entries = attr->max_entries; - array->map.map_flags = attr->map_flags; - array->map.numa_node = numa_node; + bpf_map_init_from_attr(&array->map, attr); array->elem_size = elem_size; if (!percpu) -- 2.15.1
[RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc
On Wed, Jan 17, 2018 at 06:52:32PM +, Al Viro wrote: [use of set_fs() by sunrpc] > We are asking for recvmsg() with zero data length; what we really want is > ->msg_control. And _that_ is why we need that set_fs() - we want the damn > thing to go into local variable. > > But note that filling ->msg_control will happen in put_cmsg(), called > from ip_cmsg_recv_pktinfo(), called from ip_cmsg_recv_offset(), > called from udp_recvmsg(), called from sock_recvmsg_nosec(), called > from sock_recvmsg(). Or in another path in case of IPv6. > Sure, we can arrange for propagation of that all way down those > call chains. My preference would be to try and mark that (one and > only) case in ->msg_flags, so that put_cmsg() would be able to > check. ___sys_recvmsg() sets that as > msg_sys->msg_flags = flags & (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT); > so we ought to be free to use any bit other than those two. Since > put_cmsg() already checks ->msg_flags, that shouldn't put too much > overhead. Folks, does anybody have objections against the following: Get rid of kernel_recvmsg() (and thus set_fs()) use in sunrpc In net/sunrpc/svcsock.c:svc_udp_recvfrom() we want to get IP_PKTINFO; currently we stash the address of local variable into ->msg_control (which normall contains a userland pointer in recepients) and issue zero-length ->recvmsg() under set_fs(KERNEL_DS). Similar to the way put_cmsg() handles 32bit case on biarch targets, introduce a flag telling put_cmsg() to treat ->msg_control as kernel pointer, using memcpy instead of copy_to_user(). That allows to avoid the use of kernel_recvmsg() with its set_fs(). All places that might have non-NULL ->msg_control either pass the msghdr only to ->sendmsg() and its ilk, or have ->msg_flags sanitized before passing msghdr anywhere. IOW, there no way the new flag to reach put_cmsg() in the mainline kernel, and after this change it will only be seen in sunrpc case. Incidentally, all other kernel_recvmsg() users are very easy to convert to sock_recvmsg(), so that would allow to kill kernel_recvmsg() off... Signed-off-by: Al Viro --- diff --git a/include/linux/socket.h b/include/linux/socket.h index 9286a5a8c60c..60947da9c806 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -298,6 +298,7 @@ struct ucred { #else #define MSG_CMSG_COMPAT0 /* We never have 32 bit fixups */ #endif +#define MSG_CMSG_KERNEL0x1000 /* Setsockoptions(2) level. Thanks to BSD these must match IPPROTO_xxx */ diff --git a/net/core/scm.c b/net/core/scm.c index b1ff8a441748..1b73b682e441 100644 --- a/net/core/scm.c +++ b/net/core/scm.c @@ -237,10 +237,17 @@ int put_cmsg(struct msghdr * msg, int level, int type, int len, void *data) cmhdr.cmsg_len = cmlen; err = -EFAULT; - if (copy_to_user(cm, &cmhdr, sizeof cmhdr)) - goto out; - if (copy_to_user(CMSG_DATA(cm), data, cmlen - sizeof(struct cmsghdr))) - goto out; + if (unlikely(MSG_CMSG_KERNEL & msg->msg_flags)) { + struct cmsghdr *p = msg->msg_control; + memcpy(p, &cmhdr, sizeof cmhdr); + memcpy(CMSG_DATA(p), data, cmlen - sizeof(struct cmsghdr)); + } else { + if (copy_to_user(cm, &cmhdr, sizeof cmhdr)) + goto out; + if (copy_to_user(CMSG_DATA(cm), data, +cmlen - sizeof(struct cmsghdr))) + goto out; + } cmlen = CMSG_SPACE(len); if (msg->msg_controllen < cmlen) cmlen = msg->msg_controllen; diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 5570719e4787..774904f67c93 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -545,7 +545,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp) .msg_name = svc_addr(rqstp), .msg_control = cmh, .msg_controllen = sizeof(buffer), - .msg_flags = MSG_DONTWAIT, + .msg_flags = MSG_DONTWAIT | MSG_CMSG_KERNEL, }; size_t len; int err; @@ -565,8 +565,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp) clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); skb = NULL; - err = kernel_recvmsg(svsk->sk_sock, &msg, NULL, -0, 0, MSG_PEEK | MSG_DONTWAIT); + err = sock_recvmsg(svsk->sk_sock, &msg, MSG_PEEK | MSG_DONTWAIT); if (err >= 0) skb = skb_recv_udp(svsk->sk_sk, 0, 1, &err);
RE: [PATCH v2] r8152: disable RX aggregation on Dell TB16 dock
[...] > > r8153 on Dell TB15/16 dock corrupts rx packets. > > > > This change is suggested by Realtek. They guess that the XHCI > > controller doesn't have enough buffer, and their guesswork is correct, > > once the RX aggregation gets disabled, the issue is gone. > > > > ASMedia is currently working on a real sulotion for this issue. > > > > Dell and ODM confirm the bcdDevice and iSerialNumber is unique for TB16. > > > > Note that TB15 has different bcdDevice and iSerialNumber, which are > > not unique values. If you still have TB15, please contact Dell to > > replace it with TB16. Excuse me. I don't understand why this patch is for specific USB nic rather than xHCI. It seems to make the specific USB nic working and the other ones keeping error. Best Regards, Hayes
[PATCH net-next 07/12] nfp: add TLV capabilities to the BAR
NFP is entirely programmable, including the PCI data interface. Using a fixed control BAR layout certainly makes implementations easier, but require careful considerations when space is allocated. Once BAR area is allocated to one feature nothing else can use it. Allocating space statically also requires it to be sized upfront, which leads to either unnecessary limitation or wastage. We currently have a 32bit capability word defined which tells drivers which application FW features are supported. Most of the bits are exhausted. The same bits are also reused for enabling specific features. Bulk of capabilities don't have a need for an enable bit, however, leading to confusion and wastage. TLVs seems like a better fit for expressing capabilities of applications running on programmable hardware. This patch leaves the front of the BAR as is, and declares a TLV capability start at offset 0x58. Most of the space up to 0x0d90 is already allocated, but the used space can be wrapped with RESERVED TLVs. E.g.: AddressType Length 0x0058RESERVED 0xe00 /* Wrap basic structures */ 0x0e5cFEATURE_A 0x004 0x0e64FEATURE_B 0x004 0x0e6cRESERVED 0x990 /* Wrap qeueue stats */ 0x1800FEATURE_C 0x100 Signed-off-by: Jakub Kicinski Reviewed-by: Dirk van der Merwe --- drivers/net/ethernet/netronome/nfp/Makefile| 1 + drivers/net/ethernet/netronome/nfp/nfp_net.h | 3 + .../net/ethernet/netronome/nfp/nfp_net_common.c| 5 + drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c | 113 + drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h | 67 +++- 5 files changed, 186 insertions(+), 3 deletions(-) create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c diff --git a/drivers/net/ethernet/netronome/nfp/Makefile b/drivers/net/ethernet/netronome/nfp/Makefile index 064f00e23a19..d5866d708dfa 100644 --- a/drivers/net/ethernet/netronome/nfp/Makefile +++ b/drivers/net/ethernet/netronome/nfp/Makefile @@ -22,6 +22,7 @@ nfp-objs := \ nfp_hwmon.o \ nfp_main.o \ nfp_net_common.o \ + nfp_net_ctrl.o \ nfp_net_debugdump.o \ nfp_net_ethtool.o \ nfp_net_main.o \ diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h index 6f6e3d6fd935..d88eda9707e6 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h @@ -578,6 +578,7 @@ struct nfp_net_dp { * @qcp_cfg:Pointer to QCP queue used for configuration notification * @tx_bar: Pointer to mapped TX queues * @rx_bar: Pointer to mapped FL/RX queues + * @tlv_caps: Parsed TLV capabilities * @debugfs_dir: Device directory in debugfs * @vnic_list: Entry on device vNIC list * @pdev: Backpointer to PCI device @@ -644,6 +645,8 @@ struct nfp_net { u8 __iomem *tx_bar; u8 __iomem *rx_bar; + struct nfp_net_tlv_caps tlv_caps; + struct dentry *debugfs_dir; struct list_head vnic_list; diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 2b5cad3069a7..b47da7ff01dd 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -3815,6 +3815,11 @@ int nfp_net_init(struct nfp_net *nn) nn->dp.ctrl |= NFP_NET_CFG_CTRL_IRQMOD; } + err = nfp_net_tlv_caps_parse(&nn->pdev->dev, nn->dp.ctrl_bar, +&nn->tlv_caps); + if (err) + return err; + if (nn->dp.netdev) nfp_net_netdev_init(nn); diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c new file mode 100644 index ..ff155242a665 --- /dev/null +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c @@ -0,0 +1,113 @@ +/* + * Copyright (C) 2018 Netronome Systems, Inc. + * + * This software is dual licensed under the GNU General License Version 2, + * June 1991 as shown in the file COPYING in the top-level directory of this + * source tree or the BSD 2-Clause License provided below. You have the + * option to license this software under the complete terms of either license. + * + * The BSD 2-Clause License: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * 1. Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * 2. Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other
[PATCH net-next 06/12] nfp: improve app not found message
When driver app matching loaded FW is not found users are faced with: nfp: failed to find app with ID 0x%02x This message does not properly explain that matching driver code is either not built into the driver or the driver is too old. Signed-off-by: Jakub Kicinski Reviewed-by: Dirk van der Merwe --- drivers/net/ethernet/netronome/nfp/nfp_app.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_app.c b/drivers/net/ethernet/netronome/nfp/nfp_app.c index 8ce65d4276b1..6aedef0ad433 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_app.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_app.c @@ -124,7 +124,7 @@ struct nfp_app *nfp_app_alloc(struct nfp_pf *pf, enum nfp_app_id id) struct nfp_app *app; if (id >= ARRAY_SIZE(apps) || !apps[id]) { - nfp_err(pf->cpp, "failed to find app with ID 0x%02hhx\n", id); + nfp_err(pf->cpp, "unknown FW app ID 0x%02hhx, driver too old or support for FW not built in\n", id); return ERR_PTR(-EINVAL); } -- 2.15.1
[PATCH net-next 05/12] nfp: protect each repr pointer individually with RCU
Representors are grouped in sets by type. Currently the whole sets are under RCU protection, but individual representor pointers are not. This causes some inconveniences when representors have to be destroyed, because we have to allocate new sets to remove any representors. Protect the individual pointers with RCU. Signed-off-by: Jakub Kicinski Reviewed-by: Dirk van der Merwe --- drivers/net/ethernet/netronome/nfp/flower/main.c | 45 -- drivers/net/ethernet/netronome/nfp/nfp_net_repr.c | 71 +++ drivers/net/ethernet/netronome/nfp/nfp_net_repr.h | 15 +++-- 3 files changed, 68 insertions(+), 63 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c b/drivers/net/ethernet/netronome/nfp/flower/main.c index 67c406815365..3c05b637 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/main.c +++ b/drivers/net/ethernet/netronome/nfp/flower/main.c @@ -99,7 +99,7 @@ nfp_flower_repr_get(struct nfp_app *app, u32 port_id) if (port >= reprs->num_reprs) return NULL; - return reprs->reprs[port]; + return rcu_dereference(reprs->reprs[port]); } static int @@ -114,15 +114,19 @@ nfp_flower_reprs_reify(struct nfp_app *app, enum nfp_repr_type type, if (!reprs) return 0; - for (i = 0; i < reprs->num_reprs; i++) - if (reprs->reprs[i]) { - struct nfp_repr *repr = netdev_priv(reprs->reprs[i]); + for (i = 0; i < reprs->num_reprs; i++) { + struct net_device *netdev; + + netdev = nfp_repr_get_locked(app, reprs, i); + if (netdev) { + struct nfp_repr *repr = netdev_priv(netdev); err = nfp_flower_cmsg_portreify(repr, exists); if (err) return err; count++; } + } return count; } @@ -234,19 +238,21 @@ nfp_flower_spawn_vnic_reprs(struct nfp_app *app, return -ENOMEM; for (i = 0; i < cnt; i++) { + struct net_device *repr; struct nfp_port *port; u32 port_id; - reprs->reprs[i] = nfp_repr_alloc(app); - if (!reprs->reprs[i]) { + repr = nfp_repr_alloc(app); + if (!repr) { err = -ENOMEM; goto err_reprs_clean; } + RCU_INIT_POINTER(reprs->reprs[i], repr); /* For now we only support 1 PF */ WARN_ON(repr_type == NFP_REPR_TYPE_PF && i); - port = nfp_port_alloc(app, port_type, reprs->reprs[i]); + port = nfp_port_alloc(app, port_type, repr); if (repr_type == NFP_REPR_TYPE_PF) { port->pf_id = i; port->vnic = priv->nn->dp.ctrl_bar; @@ -257,11 +263,11 @@ nfp_flower_spawn_vnic_reprs(struct nfp_app *app, app->pf->vf_cfg_mem + i * NFP_NET_CFG_BAR_SZ; } - eth_hw_addr_random(reprs->reprs[i]); + eth_hw_addr_random(repr); port_id = nfp_flower_cmsg_pcie_port(nfp_pcie, vnic_type, i, queue); - err = nfp_repr_init(app, reprs->reprs[i], + err = nfp_repr_init(app, repr, port_id, port, priv->nn->dp.netdev); if (err) { nfp_port_free(port); @@ -270,7 +276,7 @@ nfp_flower_spawn_vnic_reprs(struct nfp_app *app, nfp_info(app->cpp, "%s%d Representor(%s) created\n", repr_type == NFP_REPR_TYPE_PF ? "PF" : "VF", i, -reprs->reprs[i]->name); +repr->name); } nfp_app_reprs_set(app, repr_type, reprs); @@ -291,7 +297,7 @@ nfp_flower_spawn_vnic_reprs(struct nfp_app *app, err_reprs_remove: reprs = nfp_app_reprs_set(app, repr_type, NULL); err_reprs_clean: - nfp_reprs_clean_and_free(reprs); + nfp_reprs_clean_and_free(app, reprs); return err; } @@ -329,17 +335,18 @@ nfp_flower_spawn_phy_reprs(struct nfp_app *app, struct nfp_flower_priv *priv) for (i = 0; i < eth_tbl->count; i++) { unsigned int phys_port = eth_tbl->ports[i].index; + struct net_device *repr; struct nfp_port *port; u32 cmsg_port_id; - reprs->reprs[phys_port] = nfp_repr_alloc(app); - if (!reprs->reprs[phys_port]) { + repr = nfp_repr_alloc(app); + if (!repr) { err = -ENOMEM; goto err_reprs_clean; } + RCU_INIT_POINTER(reprs->reprs[phys_port], repr); - port = nfp_port_alloc(app, NFP_PORT_PHYS_PORT, -
[PATCH net-next 04/12] nfp: add nfp_reprs_get_locked() helper
The write side of repr tables is always done under pf->lock. Add a helper to dereference repr table pointers under protection of that lock. Signed-off-by: Jakub Kicinski Reviewed-by: Dirk van der Merwe --- drivers/net/ethernet/netronome/nfp/nfp_app.c | 12 ++-- drivers/net/ethernet/netronome/nfp/nfp_app.h | 2 ++ drivers/net/ethernet/netronome/nfp/nfp_net_repr.c | 4 +--- 3 files changed, 13 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_app.c b/drivers/net/ethernet/netronome/nfp/nfp_app.c index 955a9f44d244..8ce65d4276b1 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_app.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_app.c @@ -32,6 +32,8 @@ */ #include +#include +#include #include #include @@ -98,14 +100,20 @@ nfp_app_ctrl_msg_alloc(struct nfp_app *app, unsigned int size, gfp_t priority) return skb; } +struct nfp_reprs * +nfp_reprs_get_locked(struct nfp_app *app, enum nfp_repr_type type) +{ + return rcu_dereference_protected(app->reprs[type], +lockdep_is_held(&app->pf->lock)); +} + struct nfp_reprs * nfp_app_reprs_set(struct nfp_app *app, enum nfp_repr_type type, struct nfp_reprs *reprs) { struct nfp_reprs *old; - old = rcu_dereference_protected(app->reprs[type], - lockdep_is_held(&app->pf->lock)); + old = nfp_reprs_get_locked(app, type); rcu_assign_pointer(app->reprs[type], reprs); return old; diff --git a/drivers/net/ethernet/netronome/nfp/nfp_app.h b/drivers/net/ethernet/netronome/nfp/nfp_app.h index 6a6eb02b516e..91d469a8e3e6 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_app.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_app.h @@ -384,6 +384,8 @@ static inline struct net_device *nfp_app_repr_get(struct nfp_app *app, u32 id) struct nfp_app *nfp_app_from_netdev(struct net_device *netdev); +struct nfp_reprs * +nfp_reprs_get_locked(struct nfp_app *app, enum nfp_repr_type type); struct nfp_reprs * nfp_app_reprs_set(struct nfp_app *app, enum nfp_repr_type type, struct nfp_reprs *reprs); diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c index 317f87cc3cc6..e2452176fa23 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c @@ -424,9 +424,7 @@ int nfp_reprs_resync_phys_ports(struct nfp_app *app) struct nfp_repr *repr; int i; - old_reprs = - rcu_dereference_protected(app->reprs[NFP_REPR_TYPE_PHYS_PORT], - lockdep_is_held(&app->pf->lock)); + old_reprs = nfp_reprs_get_locked(app, NFP_REPR_TYPE_PHYS_PORT); if (!old_reprs) return 0; -- 2.15.1
[PATCH net-next 09/12] nfp: read mailbox address from TLV caps
Allow specifying alternative vNIC mailbox location in TLV caps. This way we can size the mailbox to the needs and not necessarily waste 512B of ctrl memory space. Signed-off-by: Jakub Kicinski Reviewed-by: Dirk van der Merwe --- .../net/ethernet/netronome/nfp/nfp_net_common.c| 20 +++-- drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c | 11 ++ drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h | 25 ++ 3 files changed, 46 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 4218a8660d46..15c2fec4f520 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -293,9 +293,15 @@ int nfp_net_reconfig(struct nfp_net *nn, u32 update) */ static int nfp_net_reconfig_mbox(struct nfp_net *nn, u32 mbox_cmd) { + u32 mbox = nn->tlv_caps.mbox_off; int ret; - nn_writeq(nn, NFP_NET_CFG_MBOX_CMD, mbox_cmd); + if (!nfp_net_has_mbox(&nn->tlv_caps)) { + nn_err(nn, "no mailbox present, command: %u\n", mbox_cmd); + return -EIO; + } + + nn_writeq(nn, mbox + NFP_NET_CFG_MBOX_SIMPLE_CMD, mbox_cmd); ret = nfp_net_reconfig(nn, NFP_NET_CFG_UPDATE_MBOX); if (ret) { @@ -303,7 +309,7 @@ static int nfp_net_reconfig_mbox(struct nfp_net *nn, u32 mbox_cmd) return ret; } - return -nn_readl(nn, NFP_NET_CFG_MBOX_RET); + return -nn_readl(nn, mbox + NFP_NET_CFG_MBOX_SIMPLE_RET); } /* Interrupt configuration and handling @@ -3084,8 +3090,9 @@ nfp_net_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid) if (!vid) return 0; - nn_writew(nn, NFP_NET_CFG_VLAN_FILTER_VID, vid); - nn_writew(nn, NFP_NET_CFG_VLAN_FILTER_PROTO, ETH_P_8021Q); + nn_writew(nn, nn->tlv_caps.mbox_off + NFP_NET_CFG_VLAN_FILTER_VID, vid); + nn_writew(nn, nn->tlv_caps.mbox_off + NFP_NET_CFG_VLAN_FILTER_PROTO, + ETH_P_8021Q); return nfp_net_reconfig_mbox(nn, NFP_NET_CFG_MBOX_CMD_CTAG_FILTER_ADD); } @@ -3101,8 +3108,9 @@ nfp_net_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid) if (!vid) return 0; - nn_writew(nn, NFP_NET_CFG_VLAN_FILTER_VID, vid); - nn_writew(nn, NFP_NET_CFG_VLAN_FILTER_PROTO, ETH_P_8021Q); + nn_writew(nn, nn->tlv_caps.mbox_off + NFP_NET_CFG_VLAN_FILTER_VID, vid); + nn_writew(nn, nn->tlv_caps.mbox_off + NFP_NET_CFG_VLAN_FILTER_PROTO, + ETH_P_8021Q); return nfp_net_reconfig_mbox(nn, NFP_NET_CFG_MBOX_CMD_CTAG_FILTER_KILL); } diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c index 72da1b352418..ffb402746ad4 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c @@ -43,6 +43,8 @@ static void nfp_net_tlv_caps_reset(struct nfp_net_tlv_caps *caps) { memset(caps, 0, sizeof(*caps)); caps->me_freq_mhz = 1200; + caps->mbox_off = NFP_NET_CFG_MBOX_BASE; + caps->mbox_len = NFP_NET_CFG_MBOX_VAL_MAX_SZ; } int nfp_net_tlv_caps_parse(struct device *dev, u8 __iomem *ctrl_mem, @@ -102,6 +104,15 @@ int nfp_net_tlv_caps_parse(struct device *dev, u8 __iomem *ctrl_mem, caps->me_freq_mhz = readl(data); break; + case NFP_NET_CFG_TLV_TYPE_MBOX: + if (!length) { + caps->mbox_off = 0; + caps->mbox_len = 0; + } else { + caps->mbox_off = data - ctrl_mem; + caps->mbox_len = length; + } + break; default: if (!FIELD_GET(NFP_NET_CFG_TLV_HEADER_REQUIRED, hdr)) break; diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h index 4c288cdd0e18..eeecef2caac6 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h @@ -413,11 +413,14 @@ * 4B used for update command and 4B return code * followed by a max of 504B of variable length value */ -#define NFP_NET_CFG_MBOX_CMD 0x1800 -#define NFP_NET_CFG_MBOX_RET 0x1804 -#define NFP_NET_CFG_MBOX_VAL 0x1808 +#define NFP_NET_CFG_MBOX_BASE 0x1800 #define NFP_NET_CFG_MBOX_VAL_MAX_SZ0x1F8 +#define NFP_NET_CFG_MBOX_SIMPLE_CMD0x0 +#define NFP_NET_CFG_MBOX_SIMPLE_RET0x4 +#define NFP_NET_CFG_MBOX_SIMPLE_VAL0x8 +#define NFP_NET_CFG_MBOX_SIMPLE_LEN0x12 + #define NFP_NET_CFG_MBOX_CMD_CTAG_FILTER_ADD 1 #define NFP_NET_CFG_MBOX_CMD_CTAG_FILTER_KILL 2 @@ -428,
[PATCH net-next 02/12] nfp: release global resources only on the remove path
NFP app is currently shut down as soon as all the vNICs are gone. This means we can't depend on the app existing throughout the lifetime of the device. Free the app only from PCI remove path. Signed-off-by: Jakub Kicinski Reviewed-by: Dirk van der Merwe --- drivers/net/ethernet/netronome/nfp/nfp_net_main.c | 36 --- 1 file changed, 13 insertions(+), 23 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c index c505014121c4..e4f4aa5c298e 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c @@ -570,17 +570,6 @@ static int nfp_net_pci_map_mem(struct nfp_pf *pf) return err; } -static void nfp_net_pci_remove_finish(struct nfp_pf *pf) -{ - nfp_net_pf_app_stop(pf); - /* stop app first, to avoid double free of ctrl vNIC's ddir */ - nfp_net_debugfs_dir_clean(&pf->ddir); - - nfp_net_pf_free_irqs(pf); - nfp_net_pf_app_clean(pf); - nfp_net_pci_unmap_mem(pf); -} - static int nfp_net_eth_port_update(struct nfp_cpp *cpp, struct nfp_port *port, struct nfp_eth_table *eth_table) @@ -655,9 +644,6 @@ int nfp_net_refresh_port_table_sync(struct nfp_pf *pf) nfp_net_pf_free_vnic(pf, nn); } - if (list_empty(&pf->vnics)) - nfp_net_pci_remove_finish(pf); - return 0; } @@ -810,20 +796,24 @@ int nfp_net_pci_probe(struct nfp_pf *pf) void nfp_net_pci_remove(struct nfp_pf *pf) { - struct nfp_net *nn; + struct nfp_net *nn, *next; mutex_lock(&pf->lock); - if (list_empty(&pf->vnics)) - goto out; + list_for_each_entry_safe(nn, next, &pf->vnics, vnic_list) { + if (!nfp_net_is_data_vnic(nn)) + continue; + nfp_net_pf_clean_vnic(pf, nn); + nfp_net_pf_free_vnic(pf, nn); + } - list_for_each_entry(nn, &pf->vnics, vnic_list) - if (nfp_net_is_data_vnic(nn)) - nfp_net_pf_clean_vnic(pf, nn); + nfp_net_pf_app_stop(pf); + /* stop app first, to avoid double free of ctrl vNIC's ddir */ + nfp_net_debugfs_dir_clean(&pf->ddir); - nfp_net_pf_free_vnics(pf); + nfp_net_pf_free_irqs(pf); + nfp_net_pf_app_clean(pf); + nfp_net_pci_unmap_mem(pf); - nfp_net_pci_remove_finish(pf); -out: mutex_unlock(&pf->lock); cancel_work_sync(&pf->port_refresh_work); -- 2.15.1
[PATCH net-next 11/12] nfp: allow apps to disable ctrl vNIC capabilities
Most vNIC capabilities are netdev related. It makes no sense to initialize them and waste FW resources. Some are even counter-productive, like IRQ moderation, which will slow down exchange of control messages. Add to nfp_app a mask of enabled control vNIC capabilities for apps to use. Make flower and BPF enable all capabilities for now. No functional changes. Signed-off-by: Jakub Kicinski Reviewed-by: Dirk van der Merwe --- drivers/net/ethernet/netronome/nfp/bpf/main.c | 2 ++ drivers/net/ethernet/netronome/nfp/flower/main.c| 2 ++ drivers/net/ethernet/netronome/nfp/nfp_app.h| 4 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 4 4 files changed, 12 insertions(+) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c index 8823c8360047..5f021d0c88a4 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c @@ -389,6 +389,8 @@ const struct nfp_app_type app_bpf = { .id = NFP_APP_BPF_NIC, .name = "ebpf", + .ctrl_cap_mask = ~0U, + .init = nfp_bpf_init, .clean = nfp_bpf_clean, diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c b/drivers/net/ethernet/netronome/nfp/flower/main.c index 3c05b637..742d6f1575b5 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/main.c +++ b/drivers/net/ethernet/netronome/nfp/flower/main.c @@ -565,6 +565,8 @@ static void nfp_flower_stop(struct nfp_app *app) const struct nfp_app_type app_flower = { .id = NFP_APP_FLOWER_NIC, .name = "flower", + + .ctrl_cap_mask = ~0U, .ctrl_has_meta = true, .extra_cap = nfp_flower_extra_cap, diff --git a/drivers/net/ethernet/netronome/nfp/nfp_app.h b/drivers/net/ethernet/netronome/nfp/nfp_app.h index 91d469a8e3e6..7e474df90598 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_app.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_app.h @@ -66,6 +66,9 @@ extern const struct nfp_app_type app_flower; * struct nfp_app_type - application definition * @id:application ID * @name: application name + * @ctrl_cap_mask: ctrl vNIC capability mask, allows disabling features like + * IRQMOD which are on by default but counter-productive for + * control messages which are often latency-sensitive * @ctrl_has_meta: control messages have prepend of type:5/port:CTRL * * Callbacks @@ -100,6 +103,7 @@ struct nfp_app_type { enum nfp_app_id id; const char *name; + u32 ctrl_cap_mask; bool ctrl_has_meta; int (*init)(struct nfp_app *app); diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 86a88770181d..cdf52421eaca 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -3790,6 +3790,10 @@ static int nfp_net_read_caps(struct nfp_net *nn) nn->dp.rx_offset = NFP_NET_RX_OFFSET; } + /* For control vNICs mask out the capabilities app doesn't want. */ + if (!nn->dp.netdev) + nn->cap &= nn->app->type->ctrl_cap_mask; + return 0; } -- 2.15.1
[PATCH net-next 10/12] nfp: split reading capabilities out of nfp_net_init()
nfp_net_init() is a little long and we are about to add more code to reading capabilties. Move the capability reading, parsing and validating out. Only actual initialization will stay in nfp_net_init(). No functional changes. Signed-off-by: Jakub Kicinski Reviewed-by: Dirk van der Merwe --- .../net/ethernet/netronome/nfp/nfp_net_common.c| 31 ++ 1 file changed, 20 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 15c2fec4f520..86a88770181d 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -3756,18 +3756,8 @@ static void nfp_net_netdev_init(struct nfp_net *nn) nfp_net_set_ethtool_ops(netdev); } -/** - * nfp_net_init() - Initialise/finalise the nfp_net structure - * @nn:NFP Net device structure - * - * Return: 0 on success or negative errno on error. - */ -int nfp_net_init(struct nfp_net *nn) +static int nfp_net_read_caps(struct nfp_net *nn) { - int err; - - nn->dp.rx_dma_dir = DMA_FROM_DEVICE; - /* Get some of the read-only fields from the BAR */ nn->cap = nn_readl(nn, NFP_NET_CFG_CAP); nn->max_mtu = nn_readl(nn, NFP_NET_CFG_MAX_MTU); @@ -3800,6 +3790,25 @@ int nfp_net_init(struct nfp_net *nn) nn->dp.rx_offset = NFP_NET_RX_OFFSET; } + return 0; +} + +/** + * nfp_net_init() - Initialise/finalise the nfp_net structure + * @nn:NFP Net device structure + * + * Return: 0 on success or negative errno on error. + */ +int nfp_net_init(struct nfp_net *nn) +{ + int err; + + nn->dp.rx_dma_dir = DMA_FROM_DEVICE; + + err = nfp_net_read_caps(nn); + if (err) + return err; + /* Set default MTU and Freelist buffer size */ if (nn->max_mtu < NFP_NET_DEFAULT_MTU) nn->dp.mtu = nn->max_mtu; -- 2.15.1
[PATCH net-next 08/12] nfp: read ME frequency from vNIC ctrl memory
PCIe island clock frequency is used when converting coalescing parameters from usecs to NFP timestamps. Most chips don't run at 1200MHz, allow FW to provide us with the real frequency. Signed-off-by: Jakub Kicinski Reviewed-by: Dirk van der Merwe --- drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +- drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c | 11 +++ drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h | 7 +++ drivers/net/ethernet/netronome/nfp/nfp_net_main.c | 6 -- drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c | 6 -- 5 files changed, 19 insertions(+), 13 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index b47da7ff01dd..4218a8660d46 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -2458,7 +2458,7 @@ void nfp_net_coalesce_write_cfg(struct nfp_net *nn) * ME timestamp ticks. There are 16 ME clock cycles for each timestamp * count. */ - factor = nn->me_freq_mhz / 16; + factor = nn->tlv_caps.me_freq_mhz / 16; /* copy RX interrupt coalesce parameters */ value = (nn->rx_coalesce_max_frames << 16) | diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c index ff155242a665..72da1b352418 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c @@ -42,6 +42,7 @@ static void nfp_net_tlv_caps_reset(struct nfp_net_tlv_caps *caps) { memset(caps, 0, sizeof(*caps)); + caps->me_freq_mhz = 1200; } int nfp_net_tlv_caps_parse(struct device *dev, u8 __iomem *ctrl_mem, @@ -91,6 +92,16 @@ int nfp_net_tlv_caps_parse(struct device *dev, u8 __iomem *ctrl_mem, dev_err(dev, "END TLV should be empty, has len:%d\n", length); return -EINVAL; + case NFP_NET_CFG_TLV_TYPE_ME_FREQ: + if (length != 4) { + dev_err(dev, + "ME FREQ TLV should be 4B, is %dB\n", + length); + return -EINVAL; + } + + caps->me_freq_mhz = readl(data); + break; default: if (!FIELD_GET(NFP_NET_CFG_TLV_HEADER_REQUIRED, hdr)) break; diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h index cb058050e079..4c288cdd0e18 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h @@ -474,17 +474,24 @@ * %NFP_NET_CFG_TLV_TYPE_END: * Empty, end of TLV list. Must be the last TLV. Drivers will stop processing * further TLVs when encountered. + * + * %NFP_NET_CFG_TLV_TYPE_ME_FREQ: + * Single word, ME frequency in MHz as used in calculation for + * %NFP_NET_CFG_RXR_IRQ_MOD and %NFP_NET_CFG_TXR_IRQ_MOD. */ #define NFP_NET_CFG_TLV_TYPE_UNKNOWN 0 #define NFP_NET_CFG_TLV_TYPE_RESERVED 1 #define NFP_NET_CFG_TLV_TYPE_END 2 +#define NFP_NET_CFG_TLV_TYPE_ME_FREQ 3 struct device; /** * struct nfp_net_tlv_caps - parsed control BAR TLV capabilities + * @me_freq_mhz: ME clock_freq (MHz) */ struct nfp_net_tlv_caps { + u32 me_freq_mhz; }; int nfp_net_tlv_caps_parse(struct device *dev, u8 __iomem *ctrl_mem, diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c index fd9554002fca..15fa47f622aa 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c @@ -208,12 +208,6 @@ nfp_net_pf_init_vnic(struct nfp_pf *pf, struct nfp_net *nn, unsigned int id) { int err; - /* Get ME clock frequency from ctrl BAR -* XXX for now frequency is hardcoded until we figure out how -* to get the value from nfp-hwinfo into ctrl bar -*/ - nn->me_freq_mhz = 1200; - err = nfp_net_init(nn); if (err) return err; diff --git a/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c b/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c index c879626e035b..b802a1d55449 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c @@ -277,12 +277,6 @@ static int nfp_netvf_pci_probe(struct pci_dev *pdev, } nfp_net_irqs_assign(nn, vf->irq_entries, num_irqs); - /* Get ME clock frequency from ctrl BAR -* XXX for now frequency is hardcoded until we figure out how -* to get the value from nfp-hwinfo into ctrl bar -*/ - nn->me_
[PATCH net-next 01/12] nfp: core: make scalar CPP helpers fail on short accesses
Currently the helpers for accessing 4 or 8 byte values over the CPP bus return the length of IO on success. If the IO was short caller has to deal with error handling. The short IO for 4/8B values is completely impractical. Make the helpers return an error if full access was not possible. Fix the few places which are actually dealing with errors correctly, most call sites already only deal with negative return codes. Signed-off-by: Jakub Kicinski Reviewed-by: Dirk van der Merwe --- .../net/ethernet/netronome/nfp/nfp_net_debugdump.c | 19 +- .../ethernet/netronome/nfp/nfpcore/nfp_cppcore.c | 42 +- .../ethernet/netronome/nfp/nfpcore/nfp_cpplib.c| 38 .../net/ethernet/netronome/nfp/nfpcore/nfp_rtsym.c | 4 --- 4 files changed, 58 insertions(+), 45 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c b/drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c index 173646e17e94..e6f19f44b461 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c @@ -518,16 +518,15 @@ nfp_dump_csr_range(struct nfp_pf *pf, struct nfp_dumpspec_csr *spec_csr, max_rd_addr = cpp_rd_addr + be32_to_cpu(spec_csr->cpp.dump_length); while (cpp_rd_addr < max_rd_addr) { - if (is_xpb_read(&spec_csr->cpp.cpp_id)) - bytes_read = nfp_xpb_readl(pf->cpp, cpp_rd_addr, - (u32 *)dest); - else + if (is_xpb_read(&spec_csr->cpp.cpp_id)) { + err = nfp_xpb_readl(pf->cpp, cpp_rd_addr, (u32 *)dest); + } else { bytes_read = nfp_cpp_read(pf->cpp, cpp_id, cpp_rd_addr, dest, reg_sz); - if (bytes_read != reg_sz) { - if (bytes_read >= 0) - bytes_read = -EIO; - dump_header->error = cpu_to_be32(bytes_read); + err = bytes_read == reg_sz ? 0 : -EIO; + } + if (err) { + dump_header->error = cpu_to_be32(err); dump_header->error_offset = cpu_to_be32(cpp_rd_addr); break; } @@ -555,8 +554,8 @@ nfp_read_indirect_csr(struct nfp_cpp *cpp, NFP_IND_ME_REFL_WR_SIG_INIT, cpp_params.token, cpp_params.island); result = nfp_cpp_writel(cpp, cpp_id, csr_ctx_ptr_offs, context); - if (result != sizeof(context)) - return result < 0 ? result : -EIO; + if (result) + return result; cpp_id = nfp_get_numeric_cpp_id(&cpp_params); result = nfp_cpp_read(cpp, cpp_id, csr_ctx_ptr_offs, dest, reg_sz); diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c index 28262470dabf..ef30597aa319 100644 --- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c +++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c @@ -674,18 +674,20 @@ void __iomem *nfp_cpp_area_iomem(struct nfp_cpp_area *area) * @offset:Offset into area * @value: Pointer to read buffer * - * Return: length of the io, or -ERRNO + * Return: 0 on success, or -ERRNO */ int nfp_cpp_area_readl(struct nfp_cpp_area *area, unsigned long offset, u32 *value) { u8 tmp[4]; - int err; + int n; - err = nfp_cpp_area_read(area, offset, &tmp, sizeof(tmp)); - *value = get_unaligned_le32(tmp); + n = nfp_cpp_area_read(area, offset, &tmp, sizeof(tmp)); + if (n != sizeof(tmp)) + return n < 0 ? n : -EIO; - return err; + *value = get_unaligned_le32(tmp); + return 0; } /** @@ -694,16 +696,18 @@ int nfp_cpp_area_readl(struct nfp_cpp_area *area, * @offset:Offset into area * @value: Value to write * - * Return: length of the io, or -ERRNO + * Return: 0 on success, or -ERRNO */ int nfp_cpp_area_writel(struct nfp_cpp_area *area, unsigned long offset, u32 value) { u8 tmp[4]; + int n; put_unaligned_le32(value, tmp); + n = nfp_cpp_area_write(area, offset, &tmp, sizeof(tmp)); - return nfp_cpp_area_write(area, offset, &tmp, sizeof(tmp)); + return n == sizeof(tmp) ? 0 : n < 0 ? n : -EIO; } /** @@ -712,18 +716,20 @@ int nfp_cpp_area_writel(struct nfp_cpp_area *area, * @offset:Offset into area * @value: Pointer to read buffer * - * Return: length of the io, or -ERRNO + * Return: 0 on success, or -ERRNO */ int nfp_cpp_area_readq(struct nfp_cpp_area *area, unsigned long offset, u64 *value) { u8 tmp[8]; - int err; + int n; - err = nfp_cpp_area_read(area, offset
[PATCH net-next 03/12] nfp: register devlink after app is created
Devlink used to have two global locks: devlink lock and port lock, our lock ordering looked like this: devlink lock -> driver's pf->lock -> devlink port lock After recent changes port lock was replaced with per-instance lock. Unfortunately, new per-instance lock is taken on most operations now. This means we can only grab the pf->lock from the port split/unsplit ops. Lock ordering looks like this: devlink lock -> driver's pf->lock -> devlink instance lock Since we can't take pf->lock from most devlink ops, make sure nfp_apps are prepared to service them as soon as devlink is registered. Locking the pf must be pushed down after nfp_app_init() callback. The init order looks like this: nfp_app_init devlink_register nfp_app_start netdev/port_register As soon as app_init is done nfp_apps must be ready to service devlink-related callbacks. apps can only register their own devlink objects from nfp_app_start. Fixes: 2406e7e546b2 ("devlink: Add per devlink instance lock") Signed-off-by: Jakub Kicinski Reviewed-by: Dirk van der Merwe --- drivers/net/ethernet/netronome/nfp/nfp_devlink.c | 12 +--- drivers/net/ethernet/netronome/nfp/nfp_main.c | 17 ++- drivers/net/ethernet/netronome/nfp/nfp_net_main.c | 36 --- 3 files changed, 29 insertions(+), 36 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_devlink.c b/drivers/net/ethernet/netronome/nfp/nfp_devlink.c index 6c9f29c2e975..eb0fc614673d 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_devlink.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_devlink.c @@ -152,18 +152,8 @@ nfp_devlink_port_unsplit(struct devlink *devlink, unsigned int port_index) static int nfp_devlink_eswitch_mode_get(struct devlink *devlink, u16 *mode) { struct nfp_pf *pf = devlink_priv(devlink); - int ret; - - mutex_lock(&pf->lock); - if (!pf->app) { - ret = -EBUSY; - goto out; - } - ret = nfp_app_eswitch_mode_get(pf->app, mode); -out: - mutex_unlock(&pf->lock); - return ret; + return nfp_app_eswitch_mode_get(pf->app, mode); } const struct devlink_ops nfp_devlink_ops = { diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c b/drivers/net/ethernet/netronome/nfp/nfp_main.c index 0953fa8f3109..c5b91040b12e 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c @@ -499,13 +499,9 @@ static int nfp_pci_probe(struct pci_dev *pdev, if (err) goto err_hwinfo_free; - err = devlink_register(devlink, &pdev->dev); - if (err) - goto err_hwinfo_free; - err = nfp_nsp_init(pdev, pf); if (err) - goto err_devlink_unreg; + goto err_hwinfo_free; pf->mip = nfp_mip_open(pf->cpp); pf->rtbl = __nfp_rtsym_table_read(pf->cpp, pf->mip); @@ -549,8 +545,6 @@ static int nfp_pci_probe(struct pci_dev *pdev, kfree(pf->eth_tbl); kfree(pf->nspi); vfree(pf->dumpspec); -err_devlink_unreg: - devlink_unregister(devlink); err_hwinfo_free: kfree(pf->hwinfo); nfp_cpp_free(pf->cpp); @@ -571,18 +565,13 @@ static int nfp_pci_probe(struct pci_dev *pdev, static void nfp_pci_remove(struct pci_dev *pdev) { struct nfp_pf *pf = pci_get_drvdata(pdev); - struct devlink *devlink; nfp_hwmon_unregister(pf); - devlink = priv_to_devlink(pf); - - nfp_net_pci_remove(pf); - nfp_pcie_sriov_disable(pdev); pci_sriov_set_totalvfs(pf->pdev, 0); - devlink_unregister(devlink); + nfp_net_pci_remove(pf); vfree(pf->dumpspec); kfree(pf->rtbl); @@ -598,7 +587,7 @@ static void nfp_pci_remove(struct pci_dev *pdev) kfree(pf->eth_tbl); kfree(pf->nspi); mutex_destroy(&pf->lock); - devlink_free(devlink); + devlink_free(priv_to_devlink(pf)); pci_release_regions(pdev); pci_disable_device(pdev); } diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c index e4f4aa5c298e..fd9554002fca 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c @@ -373,7 +373,9 @@ nfp_net_pf_app_init(struct nfp_pf *pf, u8 __iomem *qc_bar, unsigned int stride) if (IS_ERR(pf->app)) return PTR_ERR(pf->app); + mutex_lock(&pf->lock); err = nfp_app_init(pf->app); + mutex_unlock(&pf->lock); if (err) goto err_free; @@ -401,7 +403,9 @@ nfp_net_pf_app_init(struct nfp_pf *pf, u8 __iomem *qc_bar, unsigned int stride) err_unmap: nfp_cpp_area_release_free(pf->ctrl_vnic_bar); err_app_clean: + mutex_lock(&pf->lock); nfp_app_clean(pf->app); + mutex_unlock(&pf->lock); err_free: nfp_app_free(pf->app); pf->app = NULL; @@ -414,7 +418,11 @@ static void nfp_n
[PATCH net-next 12/12] nfp: bpf: disable all ctrl vNIC capabilities
BPF firmware currently exposes IRQ moderation capability. The driver will make use of it by default, inserting 50 usec delay to every control message exchange. This cuts the number of messages per second we can exchange by almost half. None of the other capabilities make much sense for BPF control vNIC, either. Disable them all. Signed-off-by: Jakub Kicinski Reviewed-by: Dirk van der Merwe --- drivers/net/ethernet/netronome/nfp/bpf/main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c index 5f021d0c88a4..4ee11bf2aed7 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c @@ -389,7 +389,7 @@ const struct nfp_app_type app_bpf = { .id = NFP_APP_BPF_NIC, .name = "ebpf", - .ctrl_cap_mask = ~0U, + .ctrl_cap_mask = 0, .init = nfp_bpf_init, .clean = nfp_bpf_clean, -- 2.15.1
[PATCH net-next 00/12] nfp: devlink, capabilities extensions and updates
Hi! This series starts with an improvement to the usability of the device memory accessors (CPP transactions). Next few patches are devoted to fixing the devlink locking. After recent patches for mlxsw the locking scheme of devlink ops has to be reworked. Following patches improve NFP code dealing with "representors", and expands the error message printed when driver has no support for loaded FW. Second part of the series is focused on vNIC capabilities read from vNIC control memory (often referred to as "BAR0" for historical reasons). TLV capability format is established and immediately made use of. The next patches rework parsing of features for control vNIC which allows apps to mask out features they don't want enabled. Jakub Kicinski (12): nfp: core: make scalar CPP helpers fail on short accesses nfp: release global resources only on the remove path nfp: register devlink after app is created nfp: add nfp_reprs_get_locked() helper nfp: protect each repr pointer individually with RCU nfp: improve app not found message nfp: add TLV capabilities to the BAR nfp: read ME frequency from vNIC ctrl memory nfp: read mailbox address from TLV caps nfp: split reading capabilities out of nfp_net_init() nfp: allow apps to disable ctrl vNIC capabilities nfp: bpf: disable all ctrl vNIC capabilities drivers/net/ethernet/netronome/nfp/Makefile| 1 + drivers/net/ethernet/netronome/nfp/bpf/main.c | 2 + drivers/net/ethernet/netronome/nfp/flower/main.c | 47 --- drivers/net/ethernet/netronome/nfp/nfp_app.c | 14 ++- drivers/net/ethernet/netronome/nfp/nfp_app.h | 6 + drivers/net/ethernet/netronome/nfp/nfp_devlink.c | 12 +- drivers/net/ethernet/netronome/nfp/nfp_main.c | 17 +-- drivers/net/ethernet/netronome/nfp/nfp_net.h | 3 + .../net/ethernet/netronome/nfp/nfp_net_common.c| 62 +++--- drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c | 135 + drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h | 99 +-- .../net/ethernet/netronome/nfp/nfp_net_debugdump.c | 19 ++- drivers/net/ethernet/netronome/nfp/nfp_net_main.c | 76 ++-- drivers/net/ethernet/netronome/nfp/nfp_net_repr.c | 73 +-- drivers/net/ethernet/netronome/nfp/nfp_net_repr.h | 15 ++- .../net/ethernet/netronome/nfp/nfp_netvf_main.c| 6 - .../ethernet/netronome/nfp/nfpcore/nfp_cppcore.c | 42 --- .../ethernet/netronome/nfp/nfpcore/nfp_cpplib.c| 38 +++--- .../net/ethernet/netronome/nfp/nfpcore/nfp_rtsym.c | 4 - 19 files changed, 463 insertions(+), 208 deletions(-) create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c -- 2.15.1
Re: dangers of bots on the mailing lists was Re: divide error in ___bpf_prog_run
On Wed, 2018-01-17 at 17:46 -0800, Eric Biggers wrote: > On Wed, Jan 17, 2018 at 05:18:17PM -0800, Joe Perches wrote: > > On Wed, 2018-01-17 at 20:09 -0500, Theodore Ts'o wrote: > > > get_maintainer.pl, which is often not accurate > > > > Examples please. > > > > Well, the primary problem is that place the crash occurs is not necessarily > responsible for the bug. But, syzbot actually does have a file blacklist for > exactly that reason; see > https://github.com/google/syzkaller/blob/master/pkg/report/linux.go#L56 Which has no association to a problem with get_maintainer. > And yes, get_maintainer.pl sometimes isn't accurate even if the offending code > is correctly identified. That's more of a community problem, e.g. people > sometimes don't bother to remove themselves from MAINTAINERS when they quit > maintaining, and sometimes people don't feel responsible enough for a file to > add themselves to MAINTAINERS, even when in practice they are actually taking > most of the patches to it through their tree. Yup, not a get_maintainer problem. There are more than 1800 sections and more than 1200 individual names in the MAINTAINERS file. In practice, there are a few dozen maintainers that are upstream patch paths.
RE: [PATCH] net: fec: add necessary defines to work on ARM64
From: Lucas Stach Sent: Thursday, January 18, 2018 2:31 AM >The i.MX8 is a ARMv8 based SoC, that uses the same FEC IP as the earlier, >ARMv7 based, i.MX SoCs. Allow the driver to work on ARM64. > >Signed-off-by: Lucas Stach >--- > drivers/net/ethernet/freescale/fec.h | 5 +++-- > drivers/net/ethernet/freescale/fec_main.c | 8 +--- > 2 files changed, 8 insertions(+), 5 deletions(-) > >diff --git a/drivers/net/ethernet/freescale/fec.h >b/drivers/net/ethernet/freescale/fec.h >index 5385074b3b7d..e7381f8ef89d 100644 >--- a/drivers/net/ethernet/freescale/fec.h >+++ b/drivers/net/ethernet/freescale/fec.h >@@ -20,7 +20,8 @@ > #include > > #if defined(CONFIG_M523x) || defined(CONFIG_M527x) || >defined(CONFIG_M528x) || \ >-defined(CONFIG_M520x) || defined(CONFIG_M532x) || >defined(CONFIG_ARM) >+defined(CONFIG_M520x) || defined(CONFIG_M532x) || >defined(CONFIG_ARM) || \ >+defined(CONFIG_ARM64) > /* > *Just figures, Motorola would have to change the offsets for > *registers in the same peripheral device on different models >@@ -195,7 +196,7 @@ > *Evidently, ARM SoCs have the FEC block generated in a > *little endian mode so adjust endianness accordingly. > */ >-#if defined(CONFIG_ARM) >+#if defined(CONFIG_ARM) || defined(CONFIG_ARM64) > #define fec32_to_cpu le32_to_cpu > #define fec16_to_cpu le16_to_cpu > #define cpu_to_fec32 cpu_to_le32 >diff --git a/drivers/net/ethernet/freescale/fec_main.c >b/drivers/net/ethernet/freescale/fec_main.c >index a74300a4459c..45eed37373a9 100644 >--- a/drivers/net/ethernet/freescale/fec_main.c >+++ b/drivers/net/ethernet/freescale/fec_main.c >@@ -195,7 +195,8 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC >address"); > * account when setting it. > */ > #if defined(CONFIG_M523x) || defined(CONFIG_M527x) || >defined(CONFIG_M528x) || \ >-defined(CONFIG_M520x) || defined(CONFIG_M532x) || >defined(CONFIG_ARM) >+defined(CONFIG_M520x) || defined(CONFIG_M532x) || >defined(CONFIG_ARM) || \ >+defined(CONFIG_ARM64) > #define OPT_FRAME_SIZE (PKT_MAXBUF_SIZE << 16) > #else > #define OPT_FRAME_SIZE 0 >@@ -2107,7 +2108,8 @@ static int fec_enet_get_regs_len(struct net_device >*ndev) > > /* List of registers that can be safety be read to dump them with ethtool */ >#if defined(CONFIG_M523x) || defined(CONFIG_M527x) || >defined(CONFIG_M528x) || \ >- defined(CONFIG_M520x) || defined(CONFIG_M532x) || >defined(CONFIG_ARM) >+ defined(CONFIG_M520x) || defined(CONFIG_M532x) || >defined(CONFIG_ARM) || \ >+ defined(CONFIG_ARM64) > static u32 fec_enet_register_offset[] = { > FEC_IEVENT, FEC_IMASK, FEC_R_DES_ACTIVE_0, >FEC_X_DES_ACTIVE_0, > FEC_ECNTRL, FEC_MII_DATA, FEC_MII_SPEED, FEC_MIB_CTRLSTAT, >FEC_R_CNTRL, @@ -3119,7 +3121,7 @@ static int fec_enet_init(struct >net_device *ndev) > unsigned dsize_log2 = __fls(dsize); > > WARN_ON(dsize != (1 << dsize_log2)); >-#if defined(CONFIG_ARM) >+#if defined(CONFIG_ARM) || defined(CONFIG_ARM64) > fep->rx_align = 0xf; > fep->tx_align = 0xf; > #else >-- >2.11.0 Kconfig also need to add ARM64 support.
Re: dangers of bots on the mailing lists was Re: divide error in ___bpf_prog_run
On Wed, Jan 17, 2018 at 05:18:17PM -0800, Joe Perches wrote: > On Wed, 2018-01-17 at 20:09 -0500, Theodore Ts'o wrote: > > get_maintainer.pl, which is often not accurate > > Examples please. > Well, the primary problem is that place the crash occurs is not necessarily responsible for the bug. But, syzbot actually does have a file blacklist for exactly that reason; see https://github.com/google/syzkaller/blob/master/pkg/report/linux.go#L56 It definitely needs further improvement (and anyone is welcome to contribute), though it will never be perfect. There is also a KASAN change by Dmitry queued up for 4.16 that will allow KASAN to detect invalid frees. That would have detected the bug in crypto/pcrypt.c that was causing corruption in the kmalloc-1024 slab cache, and was causing crashes in all sorts of random kernel code, resulting many bug reports. So, detecting bugs early before they corrupt all sorts of random kernel data structures helps a lot too. And yes, get_maintainer.pl sometimes isn't accurate even if the offending code is correctly identified. That's more of a community problem, e.g. people sometimes don't bother to remove themselves from MAINTAINERS when they quit maintaining, and sometimes people don't feel responsible enough for a file to add themselves to MAINTAINERS, even when in practice they are actually taking most of the patches to it through their tree. Eric
[PATCH net-next] net: hns: Fix for variable may be used uninitialized warnings
When !CONFIG_REGMAP hns throws compiler warnings since dsaf_read_syscon ignores the return result from regmap_read, which allows val to be uninitialized. Fixes: 86897c960b49 ("net: hns: add syscon operation for dsaf") Reported-by: Jason Gunthorpe Signed-off-by: Huazhong Tan Signed-off-by: Yunsheng Lin --- drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c | 52 +++--- drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h | 7 +-- 2 files changed, 38 insertions(+), 21 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c index ca247c2..acf2963 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c @@ -44,12 +44,17 @@ static void dsaf_write_sub(struct dsaf_device *dsaf_dev, u32 reg, u32 val) static u32 dsaf_read_sub(struct dsaf_device *dsaf_dev, u32 reg) { - u32 ret; - - if (dsaf_dev->sub_ctrl) - ret = dsaf_read_syscon(dsaf_dev->sub_ctrl, reg); - else + u32 ret = 0; + int err; + + if (dsaf_dev->sub_ctrl) { + err = dsaf_read_syscon(dsaf_dev->sub_ctrl, reg, &ret); + if (err) + dev_err(dsaf_dev->dev, "dsaf_read_syscon error %d!\n", + err); + } else { ret = dsaf_read_reg(dsaf_dev->sc_base, reg); + } return ret; } @@ -188,18 +193,23 @@ static void cpld_led_reset_acpi(struct hns_mac_cb *mac_cb) static int cpld_set_led_id(struct hns_mac_cb *mac_cb, enum hnae_led_state status) { + u32 val = 0; + int ret; + if (!mac_cb->cpld_ctrl) return 0; switch (status) { case HNAE_LED_ACTIVE: - mac_cb->cpld_led_value = - dsaf_read_syscon(mac_cb->cpld_ctrl, -mac_cb->cpld_ctrl_reg); - dsaf_set_bit(mac_cb->cpld_led_value, DSAF_LED_ANCHOR_B, -CPLD_LED_ON_VALUE); + ret = dsaf_read_syscon(mac_cb->cpld_ctrl, mac_cb->cpld_ctrl_reg, + &val); + if (ret) + return ret; + + dsaf_set_bit(val, DSAF_LED_ANCHOR_B, CPLD_LED_ON_VALUE); dsaf_write_syscon(mac_cb->cpld_ctrl, mac_cb->cpld_ctrl_reg, - mac_cb->cpld_led_value); + val); + mac_cb->cpld_led_value = val; break; case HNAE_LED_INACTIVE: dsaf_set_bit(mac_cb->cpld_led_value, DSAF_LED_ANCHOR_B, @@ -560,12 +570,19 @@ static phy_interface_t hns_mac_get_phy_if_acpi(struct hns_mac_cb *mac_cb) int hns_mac_get_sfp_prsnt(struct hns_mac_cb *mac_cb, int *sfp_prsnt) { + u32 val = 0; + int ret; + if (!mac_cb->cpld_ctrl) return -ENODEV; - *sfp_prsnt = !dsaf_read_syscon(mac_cb->cpld_ctrl, mac_cb->cpld_ctrl_reg - + MAC_SFP_PORT_OFFSET); + ret = dsaf_read_syscon(mac_cb->cpld_ctrl, + mac_cb->cpld_ctrl_reg + MAC_SFP_PORT_OFFSET, + &val); + if (ret) + return ret; + *sfp_prsnt = !val; return 0; } @@ -615,7 +632,7 @@ static int hns_mac_config_sds_loopback(struct hns_mac_cb *mac_cb, bool en) #define RX_CSR(lane, reg) ((0x4080 + (reg) * 0x0002 + (lane) * 0x0200) * 2) u64 reg_offset = RX_CSR(lane_id[mac_cb->mac_id], 0); - int sfp_prsnt; + int sfp_prsnt = 0; int ret = hns_mac_get_sfp_prsnt(mac_cb, &sfp_prsnt); if (!mac_cb->phy_dev) { @@ -627,7 +644,7 @@ static int hns_mac_config_sds_loopback(struct hns_mac_cb *mac_cb, bool en) } if (mac_cb->serdes_ctrl) { - u32 origin; + u32 origin = 0; if (!AE_IS_VER1(mac_cb->dsaf_dev->dsaf_ver)) { #define HILINK_ACCESS_SEL_CFG 0x40008 @@ -644,7 +661,10 @@ static int hns_mac_config_sds_loopback(struct hns_mac_cb *mac_cb, bool en) HILINK_ACCESS_SEL_CFG, 3); } - origin = dsaf_read_syscon(mac_cb->serdes_ctrl, reg_offset); + ret = dsaf_read_syscon(mac_cb->serdes_ctrl, reg_offset, + &origin); + if (ret) + return ret; dsaf_set_field(origin, 1ull << 10, 10, en); dsaf_write_syscon(mac_cb->serdes_ctrl, reg_offset, origin); diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h index 46a52d9..886cbbf 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h @@ -1034,12 +1034,9 @@ static inline void dsaf_write_syscon(s
Re: [PATCH] net/mlx5e: Fix trailing semicolon
On 01/17/2018 04:09 AM, Luis de Bethencourt wrote: > The trailing semicolon is an empty statement that does no operation. > Removing it since it doesn't do anything. > > Signed-off-by: Luis de Bethencourt Looks Good, Thanks Luis. Reviewed-by: Saeed Mahameed
Re: [PATCHv2 net-next 8/8] net: sched: cls_u32: add extack support
On Wed, 17 Jan 2018 17:40:27 -0500, Alexander Aring wrote: > diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c > index 8840baa1b9b4..daeac7282387 100644 > --- a/net/sched/cls_u32.c > +++ b/net/sched/cls_u32.c > @@ -524,8 +524,10 @@ static int u32_replace_hw_hnode(struct tcf_proto *tp, > struct tc_u_hnode *h, > offloaded = true; > } > > - if (skip_sw && !offloaded) > + if (skip_sw && !offloaded) { > + NL_SET_ERR_MSG(extack, "Failed to offload filter requested with > skip sw"); > return -EINVAL; > + } > > return 0; > } Why did you ignore my comment about using NL_SET_ERR_MSG_MOD? Do you disagree with it?
Re: [PATCHv2 net-next 7/8] net: sched: cls: add extack support for tc_setup_cb_call
On Wed, 17 Jan 2018 17:40:26 -0500, Alexander Aring wrote: > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c > index 9f88107c29c5..e864ad523800 100644 > --- a/net/sched/cls_api.c > +++ b/net/sched/cls_api.c > @@ -1566,21 +1566,26 @@ static int tc_exts_setup_cb_egdev_call(struct > tcf_exts *exts, > } > > int tc_setup_cb_call(struct tcf_block *block, struct tcf_exts *exts, > - enum tc_setup_type type, void *type_data, bool err_stop) > + enum tc_setup_type type, void *type_data, bool err_stop, > + struct netlink_ext_ack *extack) > { > int ok_count; > int ret; > > ret = tcf_block_cb_call(block, type, type_data, err_stop); > - if (ret < 0) > + if (ret < 0) { > + NL_SET_ERR_MSG(extack, "Failed to initialize tcf block"); This has nothing to do with block init. > return ret; > + } > ok_count = ret; > > if (!exts) > return ok_count; > ret = tc_exts_setup_cb_egdev_call(exts, type, type_data, err_stop); > - if (ret < 0) > + if (ret < 0) { > + NL_SET_ERR_MSG(extack, "Failed to initialize tcf block > extensions"); Ditto, plus this is about redirections to other devices (hence eg[ress ]dev). exts part is an internal detail. > return ret; > + } > ok_count += ret; > > return ok_count; These messages are completely off-target, and this is not a right place to handle any of this. Quentin's patches do the correct thing for offload, please drop this patch from the series and let us handle the offload cases. Thank you.
linux-next: build warning after merge of the net-next tree
Hi all, After merging the net-next tree, today's linux-next build (powerpc ppc64_defconfig) produced this warning: net/sched/cls_api.c: In function 'tc_dump_tfilter': net/sched/cls_api.c:1362:8: warning: 'parent' may be used uninitialized in this function [-Wmaybe-uninitialized] if (!tcf_chain_dump(chain, q, parent, skb, cb, ^ index_start, &index)) Probably introduced by commit 7960d1daf278 ("net: sched: use block index as a handle instead of qdisc when block is shared") -- Cheers, Stephen Rothwell
Re: dangers of bots on the mailing lists was Re: divide error in ___bpf_prog_run
On Wed, 2018-01-17 at 20:09 -0500, Theodore Ts'o wrote: > get_maintainer.pl, which is often not accurate Examples please.
pull-request: bpf 2018-01-18
Hi David, The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix a divide by zero due to wrong if (src_reg == 0) check in 64-bit mode. Properly handle this in interpreter and mask it also generically in verifier to guard against similar checks in JITs, from Eric and Alexei. 2) Fix a bug in arm64 JIT when tail calls are involved and progs have different stack sizes, from Daniel. 3) Reject stores into BPF context that are not expected BPF_STX | BPF_MEM variant, from Daniel. 4) Mark dst reg as unknown on {s,u}bounds adjustments when the src reg has derived bounds from dead branches, from Daniel. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Thanks a lot! The following changes since commit 8155aedf512edd3f88ef19f7cacf476ace7d1322: Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf (2018-01-14 11:01:33 -0500) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git for you to fetch changes up to 6f16101e6a8b4324c36e58a29d9e0dbb287cdedb: bpf: mark dst unknown on inconsistent {s, u}bounds adjustments (2018-01-17 16:23:17 -0800) Alexei Starovoitov (1): bpf: fix 32-bit divide by zero Daniel Borkmann (3): bpf, arm64: fix stack_depth tracking in combination with tail calls bpf: reject stores into ctx via st and xadd bpf: mark dst unknown on inconsistent {s, u}bounds adjustments Eric Dumazet (1): bpf: fix divides by zero arch/arm64/net/bpf_jit_comp.c | 20 ++-- kernel/bpf/core.c | 4 +- kernel/bpf/verifier.c | 64 ++-- net/core/filter.c | 4 + tools/testing/selftests/bpf/test_verifier.c | 152 +++- 5 files changed, 219 insertions(+), 25 deletions(-)
[PATCH net-next] macvlan: Pass SIOC[SG]HWTSTAMP ioctls and get_ts_info to lower device
This patch allows to enable hardware timestamping on macvlan intefaces and find out capabilities of the lower device. Signed-off-by: Grzegorz Halat --- drivers/net/macvlan.c | 43 +++ 1 file changed, 43 insertions(+) diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index a0f2be8..314e878 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -942,6 +943,30 @@ static void macvlan_dev_get_stats64(struct net_device *dev, } } +static int macvlan_do_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) +{ + struct macvlan_dev *vlan = netdev_priv(dev); + const struct net_device_ops *ops = vlan->lowerdev->netdev_ops; + struct ifreq ifrl; + int err = -EOPNOTSUPP; + + strncpy(ifrl.ifr_name, vlan->lowerdev->name, IFNAMSIZ); + ifrl.ifr_ifru = ifr->ifr_ifru; + + switch (cmd) { + case SIOCGHWTSTAMP: + case SIOCSHWTSTAMP: + if (ops->ndo_do_ioctl) + err = ops->ndo_do_ioctl(vlan->lowerdev, &ifrl, cmd); + break; + } + + if (!err) + ifr->ifr_ifru = ifrl.ifr_ifru; + + return err; +} + static int macvlan_vlan_rx_add_vid(struct net_device *dev, __be16 proto, u16 vid) { @@ -1022,6 +1047,22 @@ static int macvlan_ethtool_get_link_ksettings(struct net_device *dev, return __ethtool_get_link_ksettings(vlan->lowerdev, cmd); } +static int macvlan_ethtool_get_ts_info(struct net_device *dev, + struct ethtool_ts_info *ts_info) +{ + const struct macvlan_dev *vlan = netdev_priv(dev); + const struct ethtool_ops *eth_ops = vlan->lowerdev->ethtool_ops; + + if (eth_ops->get_ts_info) + return eth_ops->get_ts_info(vlan->lowerdev, ts_info); + + ts_info->so_timestamping = SOF_TIMESTAMPING_RX_SOFTWARE | + SOF_TIMESTAMPING_SOFTWARE; + ts_info->phc_index = -1; + + return 0; +} + static netdev_features_t macvlan_fix_features(struct net_device *dev, netdev_features_t features) { @@ -1096,6 +1137,7 @@ static const struct ethtool_ops macvlan_ethtool_ops = { .get_link = ethtool_op_get_link, .get_link_ksettings = macvlan_ethtool_get_link_ksettings, .get_drvinfo= macvlan_ethtool_get_drvinfo, + .get_ts_info= macvlan_ethtool_get_ts_info, }; static const struct net_device_ops macvlan_netdev_ops = { @@ -,6 +1153,7 @@ static const struct net_device_ops macvlan_netdev_ops = { .ndo_set_rx_mode= macvlan_set_mac_lists, .ndo_get_stats64= macvlan_dev_get_stats64, .ndo_validate_addr = eth_validate_addr, + .ndo_do_ioctl = macvlan_do_ioctl, .ndo_vlan_rx_add_vid= macvlan_vlan_rx_add_vid, .ndo_vlan_rx_kill_vid = macvlan_vlan_rx_kill_vid, .ndo_fdb_add= macvlan_fdb_add, -- 2.7.4
Re: dangers of bots on the mailing lists was Re: divide error in ___bpf_prog_run
On Wed, Jan 17, 2018 at 04:21:13PM -0800, Alexei Starovoitov wrote: > > If syzkaller can only test one tree than linux-next should be the one. Well, there's been some controversy about that. The problem is that it's often not clear if this is long-standing bug, or a bug which is in a particular subsystem tree --- and if so, *which* subsystem tree, etc. So it gets blasted to linux-kernel, and to get_maintainer.pl, which is often not accurate --- since the location of the crash doesn't necessarily point out where the problem originated, and hence who should look at the syzbot report. And so this has caused some irritation. > There is some value of testing stable trees, but any developer > will first ask for a reproducer in the latest, so usefulness of > reporting such bugs will be limited. What I suggested was to test Linus's tree, and then when a problem is found, and syzkaller has a reliable repro, to *then* try to see if it *also* shows up in the LTS kernels. - Ted
linux-next: manual merge of the net-next tree with the net tree
Hi all, Today's linux-next merge of the net-next tree got a conflict in: drivers/net/tun.c between commit: 4df0bfc79904 ("tun: fix a memory leak for tfile->tx_array") from the net tree and commits: 8bf5c4ee1889 ("tun: setup xdp_rxq_info") 5990a30510ed ("tun/tap: use ptr_ring instead of skb_array") fc72d1d54dd9 ("tuntap: XDP transmission") from the net-next tree. I have no idea how to clean this up, so I effectively reverted the net tree commit. I fixed it up (see above) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell
[PATCH bpf-next 0/2] bpf: increase bpf test coverage
BPF verifier has 700+ tests used to check correctness of the verifier. Beyond checking the verifier log tell kernel to run accepted programs as well via bpf_prog_test_run() command. That improves quality of the tests and increases bpf test coverage. Alexei Starovoitov (2): bpf: allow socket_filter programs to use bpf_prog_test_run selftests/bpf: make test_verifier run most programs kernel/bpf/syscall.c| 2 ++ net/core/filter.c | 1 + tools/testing/selftests/bpf/test_verifier.c | 50 - 3 files changed, 52 insertions(+), 1 deletion(-) -- 2.9.5
[PATCH bpf-next 1/2] bpf: allow socket_filter programs to use bpf_prog_test_run
in order to improve test coverage allow socket_filter program type to be run via bpf_prog_test_run command. Since such programs can be loaded by non-root tighten permissions for bpf_prog_test_run to be root only to avoid surprises. Signed-off-by: Alexei Starovoitov --- kernel/bpf/syscall.c | 2 ++ net/core/filter.c| 1 + 2 files changed, 3 insertions(+) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index c691b9e972e3..08ca42b1b916 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1504,6 +1504,8 @@ static int bpf_prog_test_run(const union bpf_attr *attr, struct bpf_prog *prog; int ret = -ENOTSUPP; + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; if (CHECK_ATTR(BPF_PROG_TEST_RUN)) return -EINVAL; diff --git a/net/core/filter.c b/net/core/filter.c index db2ee8c7e1bd..30fafaaa90fa 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4526,6 +4526,7 @@ const struct bpf_verifier_ops sk_filter_verifier_ops = { }; const struct bpf_prog_ops sk_filter_prog_ops = { + .test_run = bpf_prog_test_run_skb, }; const struct bpf_verifier_ops tc_cls_act_verifier_ops = { -- 2.9.5
[PATCH bpf-next 2/2] selftests/bpf: make test_verifier run most programs
to improve test coverage make test_verifier run all successfully loaded programs on 64-byte zero initialized data. For clsbpf and xdp it means empty 64-byte packet. For lwt and socket_filters it's 64-byte packet where skb->data points after L2. Signed-off-by: Alexei Starovoitov --- tools/testing/selftests/bpf/test_verifier.c | 50 - 1 file changed, 49 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index 543847957fdd..87da8d1bf2dd 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -29,6 +29,7 @@ #include #include #include +#include #include @@ -49,6 +50,8 @@ #define MAX_INSNS 512 #define MAX_FIXUPS 8 #define MAX_NR_MAPS4 +#define POINTER_VALUE 0xcafe4all +#define TEST_DATA_LEN 64 #define F_NEEDS_EFFICIENT_UNALIGNED_ACCESS (1 << 0) #define F_LOAD_WITH_STRICT_ALIGNMENT (1 << 1) @@ -62,6 +65,7 @@ struct bpf_test { int fixup_map_in_map[MAX_FIXUPS]; const char *errstr; const char *errstr_unpriv; + uint32_t retval; enum { UNDEF, ACCEPT, @@ -95,6 +99,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .result = ACCEPT, + .retval = -3, }, { "unreachable", @@ -210,6 +215,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .result = ACCEPT, + .retval = 1, }, { "test8 ld_imm64", @@ -477,6 +483,7 @@ static struct bpf_test tests[] = { .errstr_unpriv = "R0 leaks addr", .result = ACCEPT, .result_unpriv = REJECT, + .retval = POINTER_VALUE, }, { "check valid spill/fill, skb mark", @@ -763,6 +770,7 @@ static struct bpf_test tests[] = { .errstr_unpriv = "R1 pointer comparison", .result_unpriv = REJECT, .result = ACCEPT, + .retval = -ENOENT, }, { "jump test 4", @@ -1783,6 +1791,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .result = ACCEPT, + .retval = 0xfaceb00c, }, { "PTR_TO_STACK store/load - bad alignment on off", @@ -1841,6 +1850,7 @@ static struct bpf_test tests[] = { .result = ACCEPT, .result_unpriv = REJECT, .errstr_unpriv = "R0 leaks addr", + .retval = POINTER_VALUE, }, { "unpriv: add const to pointer", @@ -2014,6 +2024,7 @@ static struct bpf_test tests[] = { BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6, 0), BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_get_hash_recalc), + BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }, .result = ACCEPT, @@ -2778,6 +2789,7 @@ static struct bpf_test tests[] = { }, .result = ACCEPT, .prog_type = BPF_PROG_TYPE_SCHED_CLS, + .retval = 1, }, { "direct packet access: test12 (and, good access)", @@ -2802,6 +2814,7 @@ static struct bpf_test tests[] = { }, .result = ACCEPT, .prog_type = BPF_PROG_TYPE_SCHED_CLS, + .retval = 1, }, { "direct packet access: test13 (branches, good access)", @@ -2832,6 +2845,7 @@ static struct bpf_test tests[] = { }, .result = ACCEPT, .prog_type = BPF_PROG_TYPE_SCHED_CLS, + .retval = 1, }, { "direct packet access: test14 (pkt_ptr += 0, CONST_IMM, good access)", @@ -2855,6 +2869,7 @@ static struct bpf_test tests[] = { }, .result = ACCEPT, .prog_type = BPF_PROG_TYPE_SCHED_CLS, + .retval = 1, }, { "direct packet access: test15 (spill with xadd)", @@ -3141,6 +3156,7 @@ static struct bpf_test tests[] = { }, .result = ACCEPT, .prog_type = BPF_PROG_TYPE_SCHED_CLS, + .retval = 1, }, { "direct packet access: test28 (marking on <=, bad access)", @@ -5758,6 +5774,7 @@ static struct bpf_test tests[] = { }, .result = ACCEPT, .prog_type = BPF_PROG_TYPE_SCHED_CLS, + .retval = 0 /* csum_diff of 64-byte packet */, }, { "helper access to variable memory: size = 0 not allow
Re: [bpf-next PATCH] samples/bpf: xdp2skb_meta comment explain why pkt-data pointers are invalidated
On 01/17/2018 11:17 AM, Jesper Dangaard Brouer wrote: > Improve the 'unknown reason' comment, with an actual explaination of why > the ctx pkt-data pointers need to be loaded after the helper function > bpf_xdp_adjust_meta(). Based on the explaination Daniel gave. > > Fixes: 36e04a2d78d9 ("samples/bpf: xdp2skb_meta shows transferring info from > XDP to SKB") > Reported-by: Daniel Borkmann > Signed-off-by: Jesper Dangaard Brouer Applied to bpf-next, thanks for following up on this, Jesper!
Re: [PATCH bpf-next 0/3] bpf: add dumping and disassembler for non-host JITs
On 01/17/2018 01:05 AM, Jakub Kicinski wrote: > Hi, > > Currently bpftool could disassemble host jited image, for example x86_64, > using libbfd. However it couldn't disassemble offload jited image. > > There are two reasons: > > 1. bpf_obj_get_info_by_fd/struct bpf_prog_info couldn't get the address > of jited image and image's length. > > 2. Even after issue 1 resolved, bpftool couldn't figure out what is the > offload arch from bpf_prog_info, therefore can't drive libbfd > disassembler correctly. > > This patch set resolve issue 1 by introducing two new fields "jited_len" > and "jited_image" in bpf_dev_offload. These two fields serve as the generic > interface to communicate the jited image address and length for all offload > targets to higher level caller. For example, bpf_obj_get_info_by_fd could > use them to fill the userspace visible fields jited_prog_len and > jited_prog_insns. > > This patch set resolve issue 2 by getting bfd backend name through > "ifindex", i.e network interface index. > > v1: > - Deduct bfd arch name through ifindex, i.e network interface index. >First, map ifindex to devname through ifindex_to_name_ns, then get >pci id through /sys/class/dev/DEVNAME/device/vendor. (Daniel, Alexei) Looks good, series applied to bpf-next, thanks guys!
Re: [PATCH bpf v2] bpf: mark dst unknown on inconsistent {s,u}bounds adjustments
On Thu, Jan 18, 2018 at 01:15:21AM +0100, Daniel Borkmann wrote: > syzkaller generated a BPF proglet and triggered a warning with > the following: > > 0: (b7) r0 = 0 > 1: (d5) if r0 s<= 0x0 goto pc+0 >R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0 > 2: (1f) r0 -= r1 >R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0 > verifier internal error: known but bad sbounds > > What happens is that in the first insn, r0's min/max value > are both 0 due to the immediate assignment, later in the jsle > test the bounds are updated for the min value in the false > path, meaning, they yield smin_val = 1, smax_val = 0, and when > ctx pointer is subtracted from r0, verifier bails out with the > internal error and throwing a WARN since smin_val != smax_val > for the known constant. > > For min_val > max_val scenario it means that reg_set_min_max() > and reg_set_min_max_inv() (which both refine existing bounds) > demonstrated that such branch cannot be taken at runtime. > > In above scenario for the case where it will be taken, the > existing [0, 0] bounds are kept intact. Meaning, the rejection > is not due to a verifier internal error, and therefore the > WARN() is not necessary either. > > We could just reject such cases in adjust_{ptr,scalar}_min_max_vals() > when either known scalars have smin_val != smax_val or > umin_val != umax_val or any scalar reg with bounds > smin_val > smax_val or umin_val > umax_val. However, there > may be a small risk of breakage of buggy programs, so handle > this more gracefully and in adjust_{ptr,scalar}_min_max_vals() > just taint the dst reg as unknown scalar when we see ops with > such kind of src reg. > > Reported-by: syzbot+6d362cadd45dc0a12...@syzkaller.appspotmail.com > Signed-off-by: Daniel Borkmann there are several ways to fix it and this approach looks to be the safest way to do it since we're so late into the release. Applied, Thank you Daniel.
Re: [PATCH RFC net-next 6/6] rds: zerocopy Tx support.
On Wed, Jan 17, 2018 at 7:20 AM, Sowmini Varadhan wrote: > If the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and, > if the SO_ZEROCOPY socket option has been set on the PF_RDS socket, > application pages sent down with rds_sendmsg() are pinned. > > The pinning uses the accounting infrastructure added by > Commit a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages") > > The payload bytes in the message may not be modified for the > duration that the message has been pinned. A multi-threaded > application using this infrastructure may thus need to be notified > about send-completion so that it can free/reuse the buffers > passed to rds_sendmsg(). Notification of send-completion will > identify each message-buffer by a cookie that the application > must specify as ancillary data to rds_sendmsg(). > The ancillary data in this case has cmsg_level == SOL_RDS > and cmsg_type == RDS_CMSG_ZCOPY_COOKIE. > > Signed-off-by: Sowmini Varadhan > --- > include/uapi/linux/rds.h |1 + > net/rds/message.c| 44 +++- > net/rds/rds.h|3 ++- > net/rds/send.c | 27 --- > 4 files changed, 70 insertions(+), 5 deletions(-) > +static int rds_cmsg_zcopy(struct rds_sock *rs, struct rds_message *rm, > + struct cmsghdr *cmsg) > +{ > + unsigned int *cookie; Use fixed-width types across the ABI. > + > + if (cmsg->cmsg_len < CMSG_LEN(sizeof(*cookie))) > + return -EINVAL; > + cookie = CMSG_DATA(cmsg); > + rm->data.op_mmp_znotifier->z_cookie = *cookie; > + return 0; > +} > + > @@ -1107,12 +1126,14 @@ int rds_sendmsg(struct socket *sock, struct msghdr > *msg, size_t payload_len) > > /* Attach data to the rm */ > if (payload_len) { > - rm->data.op_sg = rds_message_alloc_sgs(rm, ceil(payload_len, > PAGE_SIZE)); > + int num_sgs = ceil(payload_len, PAGE_SIZE); > + > + rm->data.op_sg = rds_message_alloc_sgs(rm, num_sgs); Unrelated change. Leftover from revising the patch? Also, with user pages, data is no longer guaranteed to be page aligned, so this may need more sgs.
Re: [PATCH net-next] net: core: Expose number of link up/down transitions
On 1/17/18 3:06 PM, Florian Fainelli wrote: > diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c > index 7bf8b85ade16..9f732c3dc2ce 100644 > --- a/net/core/net-sysfs.c > +++ b/net/core/net-sysfs.c > @@ -295,10 +295,29 @@ static ssize_t carrier_changes_show(struct device *dev, > struct net_device *netdev = to_net_dev(dev); > > return sprintf(buf, fmt_dec, > -atomic_read(&netdev->carrier_changes)); > +atomic_read(&netdev->count_link_up) + > +atomic_read(&netdev->count_link_down)); > } > static DEVICE_ATTR_RO(carrier_changes); > > +static ssize_t count_link_up_show(struct device *dev, > + struct device_attribute *attr, char *buf) > +{ > + struct net_device *netdev = to_net_dev(dev); > + > + return sprintf(buf, fmt_dec, atomic_read(&netdev->count_link_up)); > +} > +static DEVICE_ATTR_RO(count_link_up); > + > +static ssize_t count_link_down_show(struct device *dev, > + struct device_attribute *attr, char *buf) > +{ > + struct net_device *netdev = to_net_dev(dev); > + > + return sprintf(buf, fmt_dec, atomic_read(&netdev->count_link_down)); > +} > +static DEVICE_ATTR_RO(count_link_down); > + > /* read-write attributes */ > > static int change_mtu(struct net_device *dev, unsigned long new_mtu) > @@ -547,6 +566,8 @@ static struct attribute *net_class_attrs[] > __ro_after_init = { > &dev_attr_phys_port_name.attr, > &dev_attr_phys_switch_id.attr, > &dev_attr_proto_down.attr, > + &dev_attr_count_link_up.attr, > + &dev_attr_count_link_down.attr, > NULL, > }; > ATTRIBUTE_GROUPS(net_class); Personally, I do not like adding any more sysfs files. As discussed in the past sysfs is a huge contributor to the overhead of netdevs.
Re: [RFC v2 net-next 02/10] net: ipv4: raw: Hook into time based transmission.
On Wed, 2018-01-17 at 15:06 -0800, Jesus Sanchez-Palencia wrote: > From: Richard Cochran > > For raw packets, copy the desired future transmit time from the CMSG > cookie into the skb. > > Signed-off-by: Richard Cochran > Signed-off-by: Jesus Sanchez-Palencia > --- > net/ipv4/raw.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c > index 136544b36a46..e37ea8ab6a78 100644 > --- a/net/ipv4/raw.c > +++ b/net/ipv4/raw.c > @@ -381,6 +381,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 > *fl4, > > skb->priority = sk->sk_priority; > skb->mark = sk->sk_mark; > + skb->tstamp = sockc->transmit_time; > skb_dst_set(skb, &rt->dst); > *rtp = NULL; > > @@ -562,6 +563,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr > *msg, size_t len) > } > > ipc.sockc.tsflags = sk->sk_tsflags; > + ipc.sockc.transmit_time = 0; > ipc.addr = inet->inet_saddr; > ipc.opt = NULL; > ipc.tx_flags = 0; It seems that skb_scrub_packet() will clear skb->tstamp, meaning that going through a tunnel will break your feature. Maybe we need to remove skb->tstamp clear from skb_scrub_packet() and do the cleaning only in forwarding path.
Re: [PATCH RFC net-next 5/6] rds: support for zcopy completion notification
On Wed, Jan 17, 2018 at 7:20 AM, Sowmini Varadhan wrote: > RDS removes a datagram from the retransmit queue when an ACK is > received. The ACK indicates that the receiver has queued the > RDS datagram, so that the sender can safely forget the datagram. > > If the datagram to be removed had pinned pages set up, add > an entry to the rs->rs_znotify_queue so that the notifcation > will be sent up via rds_rm_zerocopy_callback() when the > rds_message is eventually freed by rds_message_purge. > > Signed-off-by: Sowmini Varadhan > --- > +static void rds_rm_zerocopy_callback(struct rds_sock *rs) > +{ > + struct sock *sk = rds_rs_to_sk(rs); > + struct sk_buff *skb; > + struct sock_exterr_skb *serr; > + struct sk_buff_head *q; > + unsigned long flags; > + struct sk_buff *tail; > + u32 *ptr; > + int ncookies = 0, i; > + struct rds_znotifier *znotif, *ztmp; > + LIST_HEAD(tmp_list); > + > + spin_lock_irqsave(&rs->rs_lock, flags); > + list_splice(&rs->rs_znotify_queue, &tmp_list); > + INIT_LIST_HEAD(&rs->rs_znotify_queue); > + spin_unlock_irqrestore(&rs->rs_lock, flags); > + > + list_for_each_entry_safe(znotif, ztmp, &tmp_list, z_list) > + ncookies++; > + if (ncookies == 0) > + return; > + skb = alloc_skb(ncookies * sizeof(u32), GFP_ATOMIC); > + if (!skb) { > + spin_lock_irqsave(&rs->rs_lock, flags); > + list_splice(&tmp_list, &rs->rs_znotify_queue); > + spin_unlock_irqrestore(&rs->rs_lock, flags); > + return; > + } TCP zerocopy avoids this issue by allocating the notification skb when the zerocopy packet is created, which would be rds_message_copy_from_user. This does not add an allocation, if using the same trick of stashing the intermediate notification object (op_mmp_znotifier) in skb->cb. Though, alloc_skb is probably more expensive than that kzalloc. If nothing else, because of more zeroing. > + serr = SKB_EXT_ERR(skb); > + serr->ee.ee_errno = 0; > + serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY; > + serr->ee.ee_data = ncookies; This changes the semantics of these fields. Please add a new SO_EE_CODE flag, even if the semantics can be derived from the packet family (for now). Even better would be if we can avoid the cookies completely. I understand the issue with concurrent send threads racing on obtaining a zckey value. If the sender could learn which zckey was chosen for a call, would that suffice? I suspect that in even with concurrent senders, notifications arrive largely in order, in which case we could just maintain the existing semantics and even reuse that implementation. > + serr->ee.ee_info = 0; > + serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED; Why set the copied bit here? > + ptr = skb_put(skb, ncookies * sizeof(u32)); > + > + i = 0; > + list_for_each_entry_safe(znotif, ztmp, &tmp_list, z_list) { > + list_del(&znotif->z_list); > + ptr[i++] = znotif->z_cookie; > + mm_unaccount_pinned_pages(&znotif->z_mmp); > + kfree(znotif); > + } > + WARN_ON(!list_empty(&tmp_list)); > + q = &sk->sk_error_queue; > + spin_lock_irqsave(&q->lock, flags); > + tail = skb_peek_tail(q); > + if (!tail || > + SKB_EXT_ERR(tail)->ee.ee_origin != SO_EE_ORIGIN_ZEROCOPY) { > + __skb_queue_tail(q, skb); > + skb = NULL; > + } This drops the packet if the branch is not taken. In the TCP case this condition means that we can try to coalesce packets, but that is not the case here. > + spin_unlock_irqrestore(&q->lock, flags); > + sk->sk_error_report(sk); > + consume_skb(skb); > +} > + > /* > * This relies on dma_map_sg() not touching sg[].page during merging. > */ > @@ -66,11 +127,15 @@ static void rds_message_purge(struct rds_message *rm) > for (i = 0; i < rm->data.op_nents; i++) { > rdsdebug("putting data page %p\n", (void > *)sg_page(&rm->data.op_sg[i])); > /* XXX will have to put_page for page refs */ > - __free_page(sg_page(&rm->data.op_sg[i])); > + if (!rm->data.op_zcopy) > + __free_page(sg_page(&rm->data.op_sg[i])); > + else > + put_page(sg_page(&rm->data.op_sg[i])); > } > rm->data.op_nents = 0; > spin_lock_irqsave(&rm->m_rs_lock, flags); > if (rm->m_rs) { > + rds_rm_zerocopy_callback(rm->m_rs); > sock_put(rds_rs_to_sk(rm->m_rs)); > rm->m_rs = NULL; > } > diff --git a/net/rds/rds.h b/net/rds/rds.h > index 374ae83..de5015a 100644 > --- a/net/rds/rds.h > +++ b/net/rds/rds.h > @@ -356,6 +356,12 @@ static inline u32 > rds_rdma_cookie_offset(rds_rdma_cookie_t cookie) > #define RDS_MSG_PAGEVEC7 >
Re: [PATCH net-next] net: core: Expose number of link up/down transitions
On Thu, Jan 18, 2018 at 01:06:52AM +0100, Andrew Lunn wrote: > > What is the idea to have two separate counters? Can a delta between them > > be a bigger than 1? > > Yes, it can. > > These counters are incremented in netif_carrier_on() / > netif_carrier_off(). They are not always called in pairs and they can > be called multiple times for the same event. The phylib will call them > when it notices the PHY saying the link is down/up, and the MAC driver > sometimes also calls them. We check the __LINK_STATE_NOCARRIER bit before changing these counters, so if we call netif_carrier_on() twice, the counter will be incrimented only by one, will it not? void netif_carrier_on(struct net_device *dev) if (test_and_clear_bit(__LINK_STATE_NOCARRIER, &dev->state)) { atomic_inc(&dev->carrier_changes); ... void netif_carrier_off(struct net_device *dev) if (!test_and_set_bit(__LINK_STATE_NOCARRIER, &dev->state)) { atomic_inc(&dev->carrier_changes); > > Andrew
Re: dangers of bots on the mailing lists was Re: divide error in ___bpf_prog_run
On Wed, Jan 17, 2018 at 03:47:35PM -0500, Theodore Ts'o wrote: > On Wed, Jan 17, 2018 at 12:09:18PM +0100, Dmitry Vyukov wrote: > > On Wed, Jan 17, 2018 at 10:49 AM, Daniel Borkmann > > wrote: > > > Don't know if there's such a possibility, but it would be nice if we could > > > target fuzzing for specific subsystems in related subtrees directly (e.g. > > > for bpf in bpf and bpf-next trees as one example). Dmitry? > > > > Hi Daniel, > > > > It's doable. > > Let's start with one bpf tree. Will it be bpf or bpf-next? Which one > > contains more ongoing work? What's the exact git repo address/branch, > > so that I don't second guess? > > As a suggestion, until the bpf subsystem is free from problems that > can be found by Syzkaller in Linus's upstream tree, maybe it's not > worth trying to test individual subsystem trees such as the bpf tree? > After all, there's no point trying to bisect our way checking to see > if the problem is with a newly added commit in a development tree, if > it turns out the problem was first introduced years ago in the 4.1 or > 3.19 timeframe. > > After all, finding these older problems is going to have much higher > value, since these are the sorts of potential security problems that > are worth backporting to real device kernels for Android/ChromeOS, and > for enterprise distro kernels. So from an "impact to the industry" > perspective, focusing on Linus's tree is going to be far more > productive. That's a win for the community, and it's a win for those > people on the Syzkaller team who might be going up for promo or > listing their achievements at performance review time. :-) all correct, but if there is capacity in syzkaller server farm to test bpf and bpf-next trees it will be huge win for everyone as well. For example in the recent speculation fix we missed integer overflow case and it was found by syzkaller only when the patches landed in net tree. We did quick follow up patch, but it caused double work for us and all stable maintainers. I think finding bugs in the development trees is just as important as bugs in Linus's tree, since it improves quality of patches before they reach mainline. If syzkaller can only test one tree than linux-next should be the one. There is some value of testing stable trees, but any developer will first ask for a reproducer in the latest, so usefulness of reporting such bugs will be limited.
Re: [PATCH net-next] net: core: Expose number of link up/down transitions
On 1/17/18 3:52 PM, Jiri Pirko wrote: > Thu, Jan 18, 2018 at 12:06:57AM CET, f.faine...@gmail.com wrote: >> From: David Decotigny >> >> Expose the number of times the link has been going UP or DOWN, and >> update the "carrier_changes" counter to be the sum of these two events. >> While at it, also update the sysfs-class-net documentation to cover: >> carrier_changes (3.15), count_link_up (4.16) and count_link_down (4.16) >> >> Signed-off-by: David Decotigny > > [...] > > >> @@ -161,6 +161,8 @@ enum { >> IFLA_EVENT, >> IFLA_NEW_NETNSID, >> IFLA_IF_NETNSID, >> +IFLA_COUNT_LINK_UP, >> +IFLA_COUNT_LINK_DOWN, > > > IFLA_LINK_UP_COUNT, > IFLA_LINK_DOWN_COUNT, > > would sound a bit nicer to me. given existing IFLA_LINK_* attributes how about IFLA_CARRIER_{UP,DOWN}_COUNT?
[PATCH bpf v2] bpf: mark dst unknown on inconsistent {s,u}bounds adjustments
syzkaller generated a BPF proglet and triggered a warning with the following: 0: (b7) r0 = 0 1: (d5) if r0 s<= 0x0 goto pc+0 R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0 2: (1f) r0 -= r1 R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0 verifier internal error: known but bad sbounds What happens is that in the first insn, r0's min/max value are both 0 due to the immediate assignment, later in the jsle test the bounds are updated for the min value in the false path, meaning, they yield smin_val = 1, smax_val = 0, and when ctx pointer is subtracted from r0, verifier bails out with the internal error and throwing a WARN since smin_val != smax_val for the known constant. For min_val > max_val scenario it means that reg_set_min_max() and reg_set_min_max_inv() (which both refine existing bounds) demonstrated that such branch cannot be taken at runtime. In above scenario for the case where it will be taken, the existing [0, 0] bounds are kept intact. Meaning, the rejection is not due to a verifier internal error, and therefore the WARN() is not necessary either. We could just reject such cases in adjust_{ptr,scalar}_min_max_vals() when either known scalars have smin_val != smax_val or umin_val != umax_val or any scalar reg with bounds smin_val > smax_val or umin_val > umax_val. However, there may be a small risk of breakage of buggy programs, so handle this more gracefully and in adjust_{ptr,scalar}_min_max_vals() just taint the dst reg as unknown scalar when we see ops with such kind of src reg. Reported-by: syzbot+6d362cadd45dc0a12...@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann --- kernel/bpf/verifier.c | 27 +++--- tools/testing/selftests/bpf/test_verifier.c | 123 +++- 2 files changed, 138 insertions(+), 12 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index eb062b0..13551e6 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1895,17 +1895,13 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, dst_reg = ®s[dst]; - if (WARN_ON_ONCE(known && (smin_val != smax_val))) { - print_verifier_state(env, env->cur_state); - verbose(env, - "verifier internal error: known but bad sbounds\n"); - return -EINVAL; - } - if (WARN_ON_ONCE(known && (umin_val != umax_val))) { - print_verifier_state(env, env->cur_state); - verbose(env, - "verifier internal error: known but bad ubounds\n"); - return -EINVAL; + if ((known && (smin_val != smax_val || umin_val != umax_val)) || + smin_val > smax_val || umin_val > umax_val) { + /* Taint dst register if offset had invalid bounds derived from +* e.g. dead branches. +*/ + __mark_reg_unknown(dst_reg); + return 0; } if (BPF_CLASS(insn->code) != BPF_ALU64) { @@ -2097,6 +2093,15 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, src_known = tnum_is_const(src_reg.var_off); dst_known = tnum_is_const(dst_reg->var_off); + if ((src_known && (smin_val != smax_val || umin_val != umax_val)) || + smin_val > smax_val || umin_val > umax_val) { + /* Taint dst register if offset had invalid bounds derived from +* e.g. dead branches. +*/ + __mark_reg_unknown(dst_reg); + return 0; + } + if (!src_known && opcode != BPF_ADD && opcode != BPF_SUB && opcode != BPF_AND) { __mark_reg_unknown(dst_reg); diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index 67e7c41..5ed4175 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -6732,7 +6732,7 @@ static struct bpf_test tests[] = { BPF_JMP_IMM(BPF_JA, 0, 0, -7), }, .fixup_map1 = { 4 }, - .errstr = "unbounded min value", + .errstr = "R0 invalid mem access 'inv'", .result = REJECT, }, { @@ -8634,6 +8634,127 @@ static struct bpf_test tests[] = { .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS, }, { + "check deducing bounds from const, 1", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_IMM(BPF_JSGE, BPF_REG_0, 1, 0), + BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_1), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .errstr = "R0 tried to subtract pointer from scalar", + }, + { + "check deducing bounds from const, 2", + .insns = { + BPF_MOV64_IMM(BPF_REG
Re: [RFC PATCH net-next 02/12] selftests: forwarding: Add a test for FDB learning
On 1/17/18 3:31 PM, Jiri Pirko wrote: > Wed, Jan 17, 2018 at 11:59:10PM CET, ro...@cumulusnetworks.com wrote: >> On Wed, Jan 17, 2018 at 2:46 PM, Roopa Prabhu >> wrote: >>> On Wed, Jan 17, 2018 at 1:01 PM, Jiri Pirko wrote: Wed, Jan 17, 2018 at 09:48:54PM CET, dsah...@gmail.com wrote: > On 1/15/18 11:18 AM, Ido Schimmel wrote: >> diff --git a/tools/testing/selftests/forwarding/lib.sh >> b/tools/testing/selftests/forwarding/lib.sh >> index bb423371f4de..264bf0af4c4d 100644 >> --- a/tools/testing/selftests/forwarding/lib.sh >> +++ b/tools/testing/selftests/forwarding/lib.sh >> @@ -22,6 +22,11 @@ if [[ ! -x "$(command -v jq)" ]]; then >> exit 0 >> fi >> >> +if [[ ! -x "$(command -v mausezahn)" ]]; then >> +echo "SKIP: mausezahn not installed" >> +exit 0 >> +fi >> + > > The checks are good, but hitting a collision with this one: > $ ./router.sh > SKIP: mausezahn not installed > > For debian, it is mz not mausezahn. That is weird. If you look at the sources, the binary name is "mausezahn". Looks like debian is doing some renaming :/ >>> >>> I have seen/used both versions. Debian packages from some old upstream >>> source which calls it mz. >>> https://packages.debian.org/sid/main/mz >>> >>> I have used latest mz (with ipv6 support etc) from netsniff-ng which >>> builds it as mausezahn. >> >> >> actually Debian also builds >> https://packages.debian.org/sid/netsniff-ng and installs 'mausezahn' >> >> so you might have to check and use which exists... (mausezahn overriding mz). > > I believe we can stick with mausezahn, since it is apparently > available on Debian too. > How about the attached - make ping, ping6 and mz variables? Default to ping and mausezahn but let users update that name if necessary. Comments in the sample config to the effect need to be added to help users. diff --git a/tools/testing/selftests/forwarding/bridge.sh b/tools/testing/selftests/forwarding/bridge.sh index 7ae4f1639c45..6a34489184ea 100755 --- a/tools/testing/selftests/forwarding/bridge.sh +++ b/tools/testing/selftests/forwarding/bridge.sh @@ -101,13 +101,13 @@ trap cleanup EXIT setup_prepare setup_wait -ping_test "vrf-h1" 192.0.2.2 -ping_test "vrf-h1" 2001:db8:1::2 +ping_test ${PING} "vrf-h1" 192.0.2.2 +ping_test ${PING6} "vrf-h1" 2001:db8:1::2 old_mtu=$(mtu_get $h1) mtu_change 9000 "${netifs_arr[@]}" -ping_test "vrf-h1" 192.0.2.2 -ping_test "vrf-h1" 2001:db8:1::2 +ping_test ${PING} "vrf-h1" 192.0.2.2 +ping_test ${PING6} "vrf-h1" 2001:db8:1::2 mtu_change $old_mtu "${netifs_arr[@]}" learning_test "br0" $swp1 1 $h1 diff --git a/tools/testing/selftests/forwarding/forwarding.config.sample b/tools/testing/selftests/forwarding/forwarding.config.sample index f2b14814e4ba..78262e3b7523 100644 --- a/tools/testing/selftests/forwarding/forwarding.config.sample +++ b/tools/testing/selftests/forwarding/forwarding.config.sample @@ -17,3 +17,7 @@ declare -A OPTIONS # Time to wait after interfaces participating in the test are all UP. OPTIONS[wait_time]=5 + +#PING=ping +#PING6=ping +#MZ=mausezahn diff --git a/tools/testing/selftests/forwarding/lib.sh b/tools/testing/selftests/forwarding/lib.sh index b9b049af93cb..774f5c68ad0b 100644 --- a/tools/testing/selftests/forwarding/lib.sh +++ b/tools/testing/selftests/forwarding/lib.sh @@ -22,8 +22,8 @@ if [[ ! -x "$(command -v jq)" ]]; then exit 0 fi -if [[ ! -x "$(command -v mausezahn)" ]]; then - echo "SKIP: mausezahn not installed" +if [[ ! -x "$(command -v ${MZ})" ]]; then + echo "SKIP: ${MZ} not installed" exit 0 fi @@ -242,12 +242,13 @@ tc_offload_check() ping_test() { - local vrf_name=$1 - local dip=$2 + local ping=$1 + local vrf_name=$2 + local dip=$3 RET=0 - ip vrf exec $vrf_name ping $dip -c 10 -i 0.1 -w 2 &> /dev/null + ip vrf exec $vrf_name $ping $dip -c 10 -i 0.1 -w 2 &> /dev/null check_err $? print_result "ping" } @@ -266,7 +267,7 @@ learning_test() | jq -e '.[] | select(.mac == "de:ad:be:ef:13:37")' &> /dev/null check_fail $? "found FDB record when should not" - mausezahn $host_if -c 1 -p 64 -a de:ad:be:ef:13:37 -t ip -q + ${MZ} $host_if -c 1 -p 64 -a de:ad:be:ef:13:37 -t ip -q bridge -j fdb show br $bridge brport $br_port vlan $vid \ | jq -e '.[] | select(.mac == "de:ad:be:ef:13:37")' &> /dev/null @@ -283,7 +284,7 @@ learning_test() bridge link set dev $br_port learning off - mausezahn $host_if -c 1 -p 64 -a de:ad:be:ef:13:37 -t ip -q + ${MZ} $host_if -c 1 -p 64 -a de:ad:be:ef:13:37 -t ip -q bridge -j fdb show br $bridge brport $br_port vlan $vid \ | jq -e '.[] | select(.mac == "de:ad:be:ef:13:37")' &> /dev/null @@ -309,7 +310,7 @@ flood_test_do() tc filter add dev $host2_if ingress protocol ip pref 1 handle 101 \
Re: remove pci_dma_* abuses and workarounds V2
[+cc David] On Wed, Jan 10, 2018 at 07:03:18PM +0100, Christoph Hellwig wrote: > Back before the dawn of time pci_dma_* with a NULL pci_dev argument > was used for all kinds of things, e.g. dma mapping for non-PCI > devices. All this has been long removed, but it turns out we > still care for a NULL pci_dev in the wrappers, and we still have > two odd USB drivers that use pci_dma_alloc_consistent for allocating > memory while ignoring the dma_addr_t entirely, and a network driver > mixing the already wrong usage of dma_* with a NULL device with a > single call to pci_free_consistent. > > This series switches the two usb drivers to use plain kzalloc, the > net driver to properly use the dma API and then removes the handling > of the NULL pci_dev in the pci_dma_* wrappers. > > Changes since V1: > - remove allocation failure printks > - use kcalloc > - fix tsi108_eth > - improve changelogs Applied to pci/dma for v4.16, thanks!
Re: [PATCH 3/4] tsi108_eth: use dma API properly
[+cc David, FYI, I plan to merge this via PCI along with the rest of Christoph's series] On Wed, Jan 10, 2018 at 07:03:21PM +0100, Christoph Hellwig wrote: > We need to pass a struct device to the dma API, even if some > architectures still support that for legacy reasons, and should not mix > it with the old PCI dma API. > > Note that the driver also seems to never actually unmap its dma mappings, > but to fix that we'll need someone more familar with the driver. > > Signed-off-by: Christoph Hellwig > --- > drivers/net/ethernet/tundra/tsi108_eth.c | 36 > ++-- > 1 file changed, 20 insertions(+), 16 deletions(-) > > diff --git a/drivers/net/ethernet/tundra/tsi108_eth.c > b/drivers/net/ethernet/tundra/tsi108_eth.c > index 0624b71ab5d4..edcd1e60b30d 100644 > --- a/drivers/net/ethernet/tundra/tsi108_eth.c > +++ b/drivers/net/ethernet/tundra/tsi108_eth.c > @@ -152,6 +152,8 @@ struct tsi108_prv_data { > u32 msg_enable; /* debug message level */ > struct mii_if_info mii_if; > unsigned int init_media; > + > + struct platform_device *pdev; > }; > > /* Structure for a device driver */ > @@ -703,17 +705,18 @@ static int tsi108_send_packet(struct sk_buff * skb, > struct net_device *dev) > data->txskbs[tx] = skb; > > if (i == 0) { > - data->txring[tx].buf0 = dma_map_single(NULL, skb->data, > - skb_headlen(skb), DMA_TO_DEVICE); > + data->txring[tx].buf0 = dma_map_single(&data->pdev->dev, > + skb->data, skb_headlen(skb), > + DMA_TO_DEVICE); > data->txring[tx].len = skb_headlen(skb); > misc |= TSI108_TX_SOF; > } else { > const skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1]; > > - data->txring[tx].buf0 = skb_frag_dma_map(NULL, frag, > - 0, > - > skb_frag_size(frag), > - DMA_TO_DEVICE); > + data->txring[tx].buf0 = > + skb_frag_dma_map(&data->pdev->dev, frag, > + 0, skb_frag_size(frag), > + DMA_TO_DEVICE); > data->txring[tx].len = skb_frag_size(frag); > } > > @@ -808,9 +811,9 @@ static int tsi108_refill_rx(struct net_device *dev, int > budget) > if (!skb) > break; > > - data->rxring[rx].buf0 = dma_map_single(NULL, skb->data, > - TSI108_RX_SKB_SIZE, > - DMA_FROM_DEVICE); > + data->rxring[rx].buf0 = dma_map_single(&data->pdev->dev, > + skb->data, TSI108_RX_SKB_SIZE, > + DMA_FROM_DEVICE); > > /* Sometimes the hardware sets blen to zero after packet >* reception, even though the manual says that it's only ever > @@ -1308,15 +1311,15 @@ static int tsi108_open(struct net_device *dev) > data->id, dev->irq, dev->name); > } > > - data->rxring = dma_zalloc_coherent(NULL, rxring_size, &data->rxdma, > -GFP_KERNEL); > + data->rxring = dma_zalloc_coherent(&data->pdev->dev, rxring_size, > + &data->rxdma, GFP_KERNEL); > if (!data->rxring) > return -ENOMEM; > > - data->txring = dma_zalloc_coherent(NULL, txring_size, &data->txdma, > -GFP_KERNEL); > + data->txring = dma_zalloc_coherent(&data->pdev->dev, txring_size, > + &data->txdma, GFP_KERNEL); > if (!data->txring) { > - pci_free_consistent(NULL, rxring_size, data->rxring, > + dma_free_coherent(&data->pdev->dev, rxring_size, data->rxring, > data->rxdma); > return -ENOMEM; > } > @@ -1428,10 +1431,10 @@ static int tsi108_close(struct net_device *dev) > dev_kfree_skb(skb); > } > > - dma_free_coherent(0, > + dma_free_coherent(&data->pdev->dev, > TSI108_RXRING_LEN * sizeof(rx_desc), > data->rxring, data->rxdma); > - dma_free_coherent(0, > + dma_free_coherent(&data->pdev->dev, > TSI108_TXRING_LEN * sizeof(tx_desc), > data->txring, data->txdma); > > @@ -1576,6 +1579,7 @@ tsi108_init_one(struct platform_device *pdev) > printk("tsi108_eth%d: probe...\n", pdev->id); > data = netdev_priv(dev); > data->dev = dev; > +
Re: [PATCH net-next] net: core: Expose number of link up/down transitions
> What is the idea to have two separate counters? Can a delta between them > be a bigger than 1? Yes, it can. These counters are incremented in netif_carrier_on() / netif_carrier_off(). They are not always called in pairs and they can be called multiple times for the same event. The phylib will call them when it notices the PHY saying the link is down/up, and the MAC driver sometimes also calls them. Andrew
Re: [PATCH RFC net-next 4/6] sock: permit SO_ZEROCOPY on PF_RDS socket
On Wed, Jan 17, 2018 at 7:20 AM, Sowmini Varadhan wrote: > allow the application to set SO_ZEROCOPY on the underlying sk > of a PF_RDS socket > > Signed-off-by: Sowmini Varadhan > --- > net/core/sock.c |7 +++ > 1 files changed, 7 insertions(+), 0 deletions(-) > > diff --git a/net/core/sock.c b/net/core/sock.c > index 4f52677..f0f44b0 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -1049,6 +1049,13 @@ int sock_setsockopt(struct socket *sock, int level, > int optname, > break; > > case SO_ZEROCOPY: > + if (sk->sk_family == PF_RDS) { > + if (val < 0 || val > 1) > + ret = -EINVAL; > + else > + sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool); > + break; > + } Let's integrate this in the existing logic. Perhaps something like if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) { if (sk->sk_protocol != IPPROTO_TCP) ret = -ENOTSUPP; else if (sk->sk_state != TCP_CLOSE) ret = -EBUSY; } else if (sk->sk_protocol != PF_RDS) { ret = -ENOTSUPP; } if (!ret) { if (val < 0 || val > 1) ret = -EINVAL; else sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool); }
Re: [PATCH net-next] net: core: Expose number of link up/down transitions
Thu, Jan 18, 2018 at 12:06:57AM CET, f.faine...@gmail.com wrote: >From: David Decotigny > >Expose the number of times the link has been going UP or DOWN, and >update the "carrier_changes" counter to be the sum of these two events. >While at it, also update the sysfs-class-net documentation to cover: >carrier_changes (3.15), count_link_up (4.16) and count_link_down (4.16) > >Signed-off-by: David Decotigny [...] >@@ -161,6 +161,8 @@ enum { > IFLA_EVENT, > IFLA_NEW_NETNSID, > IFLA_IF_NETNSID, >+ IFLA_COUNT_LINK_UP, >+ IFLA_COUNT_LINK_DOWN, IFLA_LINK_UP_COUNT, IFLA_LINK_DOWN_COUNT, would sound a bit nicer to me. > __IFLA_MAX > };
Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
On Wed, Jan 17, 2018 at 7:19 AM, Sowmini Varadhan wrote: > Allow the application the ability to use MSG_PEEK with sk_error_queue > so that it can peek and re-read message in cases where MSG_TRUNC > may be encountered. > > Signed-off-by: Sowmini Varadhan > int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len, > - int level, int type) > + int level, int type, int flags) > { > struct sock_exterr_skb *serr; > struct sk_buff *skb; > @@ -2916,7 +2916,10 @@ int sock_recv_errqueue(struct sock *sk, struct msghdr > *msg, int len, > err = copied; > > out_free_skb: > - kfree_skb(skb); > + if (likely(!(flags & MSG_PEEK))) > + kfree_skb(skb); > + else > + skb_queue_head(&sk->sk_error_queue, skb); This can cause reordering with parallel readers. Can we avoid the need for peeking? It also caused a slew of subtle bugs previously. How about just define a max number of cookies and require the caller to always read with sufficient room to hold them?
Re: [PATCH net-next] net: core: Expose number of link up/down transitions
On Wed, Jan 17, 2018 at 03:06:57PM -0800, Florian Fainelli wrote: > From: David Decotigny > > Expose the number of times the link has been going UP or DOWN, and > update the "carrier_changes" counter to be the sum of these two events. > While at it, also update the sysfs-class-net documentation to cover: > carrier_changes (3.15), count_link_up (4.16) and count_link_down (4.16) What is the idea to have two separate counters? Can a delta between them be a bigger than 1? > > Signed-off-by: David Decotigny > [Florian: > * rebase > * add documentation > * merge carrier_changes with up/down counters] > Signed-off-by: Florian Fainelli > --- > Documentation/ABI/testing/sysfs-class-net | 24 > include/linux/netdevice.h | 6 -- > include/uapi/linux/if_link.h | 2 ++ > net/core/net-sysfs.c | 23 ++- > net/core/rtnetlink.c | 13 +++-- > net/sched/sch_generic.c | 4 ++-- > 6 files changed, 65 insertions(+), 7 deletions(-) > > diff --git a/Documentation/ABI/testing/sysfs-class-net > b/Documentation/ABI/testing/sysfs-class-net > index 6856da99b6f7..e4b0d5157305 100644 > --- a/Documentation/ABI/testing/sysfs-class-net > +++ b/Documentation/ABI/testing/sysfs-class-net > @@ -259,3 +259,27 @@ Contact: netdev@vger.kernel.org > Description: > Symbolic link to the PHY device this network device is attached > to. > + > +What:/sys/class/net/ +Date:Mar 2014 > +KernelVersion: 3.15 > +Contact: netdev@vger.kernel.org > +Description: > + 32-bit unsigned integer counting the number of times the link > has > + seen a change from UP to DOWN and vice versa > + > +What:/sys/class/net/ +Date:Jan 2018 > +KernelVersion: 4.16 > +Contact: netdev@vger.kernel.org > +Description: > + 32-bit unsigned integer counting the number of times the link > has > + been up > + > +What:/sys/class/net/ +Date:Jan 2018 > +KernelVersion: 4.16 > +Contact: netdev@vger.kernel.org > +Description: > + 32-bit unsigned integer counting the number of times the link > has > + been down > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index ed0799a12bf2..28f68f7513d0 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -1680,8 +1680,6 @@ struct net_device { > unsigned long base_addr; > int irq; > > - atomic_tcarrier_changes; > - > /* >* Some hardware also needs these fields (state,dev_list, >* napi_list,unreg_list,close_list) but they are not > @@ -1719,6 +1717,10 @@ struct net_device { > atomic_long_t tx_dropped; > atomic_long_t rx_nohandler; > > + /* Stats to monitor link on/off, flapping */ > + atomic_tcount_link_up; > + atomic_tcount_link_down; > + > #ifdef CONFIG_WIRELESS_EXT > const struct iw_handler_def *wireless_handlers; > struct iw_public_data *wireless_data; > diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h > index f8f04fed6186..6e44b0674ba4 100644 > --- a/include/uapi/linux/if_link.h > +++ b/include/uapi/linux/if_link.h > @@ -161,6 +161,8 @@ enum { > IFLA_EVENT, > IFLA_NEW_NETNSID, > IFLA_IF_NETNSID, > + IFLA_COUNT_LINK_UP, > + IFLA_COUNT_LINK_DOWN, > __IFLA_MAX > }; > > diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c > index 7bf8b85ade16..9f732c3dc2ce 100644 > --- a/net/core/net-sysfs.c > +++ b/net/core/net-sysfs.c > @@ -295,10 +295,29 @@ static ssize_t carrier_changes_show(struct device *dev, > struct net_device *netdev = to_net_dev(dev); > > return sprintf(buf, fmt_dec, > -atomic_read(&netdev->carrier_changes)); > +atomic_read(&netdev->count_link_up) + > +atomic_read(&netdev->count_link_down)); > } > static DEVICE_ATTR_RO(carrier_changes); > > +static ssize_t count_link_up_show(struct device *dev, > + struct device_attribute *attr, char *buf) > +{ > + struct net_device *netdev = to_net_dev(dev); > + > + return sprintf(buf, fmt_dec, atomic_read(&netdev->count_link_up)); > +} > +static DEVICE_ATTR_RO(count_link_up); > + > +static ssize_t count_link_down_show(struct device *dev, > + struct device_attribute *attr, char *buf) > +{ > + struct net_device *netdev = to_net_dev(dev); > + > + return sprintf(buf, fmt_dec, atomic_read(&netdev->count_link_down)); > +} > +static DEVICE_ATTR_RO(count_link_down); > + > /* read-write attributes */ > > static int change_mtu(struct net_device *dev, unsigned long new_mtu) > @@ -547,6 +566,
Re: [patch net-next v11 00/13] net: sched: allow qdiscs to share filter block instances
Wed, Jan 17, 2018 at 10:33:32PM CET, da...@davemloft.net wrote: >From: Jiri Pirko >Date: Wed, 17 Jan 2018 21:18:07 +0100 > >> Odd. I don't see this warning with: >> gcc version 6.4.1 20170727 (Red Hat 6.4.1-1) (GCC) > >Yeah, for reference I'm using: > >gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2) Interesting, I upgraded to the same version: gcc version 7.2.1 20170915 (Red Hat 7.2.1-2) (GCC) I still don't see the warning :O what is your cmdline?
Re: [RFC PATCH net-next 02/12] selftests: forwarding: Add a test for FDB learning
Wed, Jan 17, 2018 at 11:59:10PM CET, ro...@cumulusnetworks.com wrote: >On Wed, Jan 17, 2018 at 2:46 PM, Roopa Prabhu >wrote: >> On Wed, Jan 17, 2018 at 1:01 PM, Jiri Pirko wrote: >>> Wed, Jan 17, 2018 at 09:48:54PM CET, dsah...@gmail.com wrote: On 1/15/18 11:18 AM, Ido Schimmel wrote: > diff --git a/tools/testing/selftests/forwarding/lib.sh > b/tools/testing/selftests/forwarding/lib.sh > index bb423371f4de..264bf0af4c4d 100644 > --- a/tools/testing/selftests/forwarding/lib.sh > +++ b/tools/testing/selftests/forwarding/lib.sh > @@ -22,6 +22,11 @@ if [[ ! -x "$(command -v jq)" ]]; then > exit 0 > fi > > +if [[ ! -x "$(command -v mausezahn)" ]]; then > +echo "SKIP: mausezahn not installed" > +exit 0 > +fi > + The checks are good, but hitting a collision with this one: $ ./router.sh SKIP: mausezahn not installed For debian, it is mz not mausezahn. >>> >>> That is weird. If you look at the sources, the binary name is >>> "mausezahn". Looks like debian is doing some renaming :/ >>> >> >> I have seen/used both versions. Debian packages from some old upstream >> source which calls it mz. >> https://packages.debian.org/sid/main/mz >> >> I have used latest mz (with ipv6 support etc) from netsniff-ng which >> builds it as mausezahn. > > >actually Debian also builds >https://packages.debian.org/sid/netsniff-ng and installs 'mausezahn' > >so you might have to check and use which exists... (mausezahn overriding mz). I believe we can stick with mausezahn, since it is apparently available on Debian too.
Re: [PATCH net] net: validate untrusted gso packets
> From: Willem de Bruijn > > Validate gso packet type and headers on kernel entry. Reuse the info > gathered by skb_probe_transport_header. > > Syzbot found two bugs by passing bad gso packets in packet sockets. > Untrusted user packets are limited to a small set of gso types in > virtio_net_hdr_to_skb. But segmentation occurs on packet contents. > Syzkaller was able to enter gso callbacks that are not hardened > against untrusted user input. Do this mean there's something missed in exist header check for dodgy packets? >>> >>> virtio_net_hdr_to_skb checks gso_type, but it does not verify that this >>> type correctly describes the actual packet. Segmentation happens based >>> on packet contents. So a packet was crafted to enter sctp gso, even >>> though no such gso_type exists. This issue is not specific to sctp. >> >> >> So it looks to me we should do it in here in sctp_gso_segment(). >> >> if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) { >> /* Packet is from an untrusted source, reset gso_segs. */ > > No dodgy source can legitimately generate sctp code, so it should not > even get there. Also, a packet can just as easily spoof an esp packet. > See also the discussion in the Link above. > > We can address this specific issue in segmentation by placing a check > analogous to skb_validate_dodgy_gso in the network layer. Something > like this (but with ECN option support): > > @@ -1258,6 +1258,22 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb, > > skb_reset_transport_header(skb); > > + gso_type = skb_shinfo(skb)->gso_type; > + if (gso_type & SKB_GSO_DODGY) { > + switch (gso_type & (SKB_GSO_TCPV4 | SKB_GSO_UDP)) { > + case SKB_GSO_TCPV4: > + if (proto != IPPROTO_TCP) > + goto out; > + break; > + case SKB_GSO_UDP: > + if (proto != IPPROTO_UDP) > + goto out; > + break; > + default: > + goto out; > + } > + } Okay, I sent this instead: http://patchwork.ozlabs.org/patch/862643/
Re: [PATCH net-next] net: core: Expose number of link up/down transitions
On Wed, 17 Jan 2018 15:06:57 -0800, Florian Fainelli wrote: > +What:/sys/class/net// ?
Re: [PATCH net-next] net: stmmac: Fix reception of Broadcom switches tags
Hi Peppe, On 01/16/2018 11:06 PM, Giuseppe CAVALLARO wrote: > Hi Florian > > for gmac4.x and gmac3.x series the ACS bit is the Automatic Pad or CRC > Stripping, so the > core strips the Pad or FCS on frames if the value of the length field is > < 1536 bytes. > For MAC10-100 there is the Bit 8 (ASTP) of the reg0 that does the same > if len is < 46bytes. OK, thanks for the detail on the older core, I will update my v2 to take that into account. > In your patch I can just suggest to add a new field to strip the PAD/FCS > w/o passing the whole > netdev struct to the core_init. In the main driver, we could manage the > pad-strip feature (also > by using dt) or disable it in case of netdev_uses_dsa; then propagating > this setting to the core_init > or calling a new callback. What do you think? I don't think adding a DT property is a good idea, because we can determine whether ACS is necessary or not at runtime when DSA with proprietary Ethernet switch tags is used or not using netdev_uses_dsa(). Having the DT property to configure whether ACS is turned on or off means that the Device Tree now describes a policy, rather than a capability, which is counter to the Device Tree spirit. Besides, this leaves room for making mistake, and this creates a support burden. When interfaced with a switch, the switch is guaranteed to egress 64 bytes (including FCS) packets towards stmmac, and the only thing we need to do is ensure that stmmac does the same thing. This is a small penalty to pay for small packets, and this is done only when DSA is used. Regarding passing a net_device to the core_init functions, this was done because dev->mtu was already accessed, and we might need to access additional net_device members in the future, 2 means we need to find a solution that scales, so passing the net_device around was a natural choice. Function pointers described in common.h are not exactly consistent in what type of arguments they take, so I did not see this as being a big problem. Thank you. > > Regards > Peppe > > On 1/17/2018 12:25 AM, Florian Fainelli wrote: >> Broadcom tags inserted by Broadcom switches put a 4 byte header after >> the MAC SA and before the EtherType, which may look like some sort of 0 >> length LLC/SNAP packet (tcpdump and wireshark do think that way). With >> ACS enabled in stmmac the packets were truncated to 8 bytes on >> reception, whereas clearing this bit allowed normal reception to occur. >> >> In order to make that possible, we need to pass a net_device argument to >> the different core_init() functions and we are dependent on the Broadcom >> tagger padding packets correctly (which it now does). To be as little >> invasive as possible, this is only done for gmac1000 when the network >> device is DSA-enabled (netdev_uses_dsa() returns true). >> >> Signed-off-by: Florian Fainelli >> --- >> drivers/net/ethernet/stmicro/stmmac/common.h | 2 +- >> drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 3 ++- >> drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c | 12 +++- >> drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c | 3 ++- >> drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c | 11 ++- >> drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 +- >> 6 files changed, 27 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h >> b/drivers/net/ethernet/stmicro/stmmac/common.h >> index ce2ea2d491ac..2ffe76c0ff74 100644 >> --- a/drivers/net/ethernet/stmicro/stmmac/common.h >> +++ b/drivers/net/ethernet/stmicro/stmmac/common.h >> @@ -474,7 +474,7 @@ struct mac_device_info; >> /* Helpers to program the MAC core */ >> struct stmmac_ops { >> /* MAC core initialization */ >> - void (*core_init)(struct mac_device_info *hw, int mtu); >> + void (*core_init)(struct mac_device_info *hw, struct net_device >> *dev); >> /* Enable the MAC RX/TX */ >> void (*set_mac)(void __iomem *ioaddr, bool enable); >> /* Enable and verify that the IPC module is supported */ >> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c >> b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c >> index 9eb7f65d8000..a3fa65b1ca8e 100644 >> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c >> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c >> @@ -483,7 +483,8 @@ static int sun8i_dwmac_init(struct platform_device >> *pdev, void *priv) >> return 0; >> } >> -static void sun8i_dwmac_core_init(struct mac_device_info *hw, int mtu) >> +static void sun8i_dwmac_core_init(struct mac_device_info *hw, >> + struct net_device *dev) >> { >> void __iomem *ioaddr = hw->pcsr; >> u32 v; >> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c >> b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c >> index 8a86340ff2d3..540d21786a43 100644 >> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c >> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac10
[RFC iproute2] tc: Add support for the TBS Qdisc
From: Vinicius Costa Gomes The Time Based Scheduler (TBS) queueing discipline allows precise control of the transmission time of packets. The syntax is: tc qdisc add dev DEV parent NODE tbs delta clockid offload <1|0> Signed-off-by: Vinicius Costa Gomes --- include/uapi/linux/pkt_sched.h | 17 ++ tc/Makefile| 1 + tc/q_tbs.c | 119 + 3 files changed, 137 insertions(+) create mode 100644 tc/q_tbs.c diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h index af3cc2f4..514cb742 100644 --- a/include/uapi/linux/pkt_sched.h +++ b/include/uapi/linux/pkt_sched.h @@ -935,4 +935,21 @@ enum { #define TCA_CBS_MAX (__TCA_CBS_MAX - 1) + +/* TBS */ +struct tc_tbs_qopt { + __u8 offload; + __u8 _pad[3]; + __s32 delta; + __s32 clockid; +}; + +enum { + TCA_TBS_UNSPEC, + TCA_TBS_PARMS, + __TCA_TBS_MAX, +}; + +#define TCA_TBS_MAX (__TCA_TBS_MAX - 1) + #endif diff --git a/tc/Makefile b/tc/Makefile index 3716dd6a..3c87b0dc 100644 --- a/tc/Makefile +++ b/tc/Makefile @@ -71,6 +71,7 @@ TCMODULES += q_clsact.o TCMODULES += e_bpf.o TCMODULES += f_matchall.o TCMODULES += q_cbs.o +TCMODULES += q_tbs.o TCSO := ifeq ($(TC_CONFIG_ATM),y) diff --git a/tc/q_tbs.c b/tc/q_tbs.c new file mode 100644 index ..f4a73b58 --- /dev/null +++ b/tc/q_tbs.c @@ -0,0 +1,119 @@ +/* + * q_tbs.c TBS. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Authors:Vinicius Costa Gomes + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "utils.h" +#include "tc_util.h" + +static void explain(void) +{ + fprintf(stderr, "Usage: ... tbs delta NANOS guardband NANOS clockid CLOCKID [offload 0|1]\n"); + +} + +static void explain1(const char *arg, const char *val) +{ + fprintf(stderr, "tbs: illegal value for \"%s\": \"%s\"\n", arg, val); +} + +static int tbs_parse_opt(struct qdisc_util *qu, int argc, +char **argv, struct nlmsghdr *n, const char *dev) +{ + struct tc_tbs_qopt opt = {}; + struct rtattr *tail; + + while (argc > 0) { + if (matches(*argv, "offload") == 0) { + NEXT_ARG(); + if (opt.offload) { + fprintf(stderr, "tbs: duplicate \"offload\" specification\n"); + return -1; + } + if (get_u8(&opt.offload, *argv, 0)) { + explain1("offload", *argv); + return -1; + } + } else if (matches(*argv, "delta") == 0) { + NEXT_ARG(); + if (opt.delta) { + fprintf(stderr, "tbs: duplicate \"delta\" specification\n"); + return -1; + } + if (get_s32(&opt.delta, *argv, 0)) { + explain1("hicredit", *argv); + return -1; + } + } else if (matches(*argv, "clockid") == 0) { + NEXT_ARG(); + if (opt.clockid) { + fprintf(stderr, "tbs: duplicate \"clockid\" specification\n"); + return -1; + } + if (get_s32(&opt.clockid, *argv, 0)) { + explain1("clockid", *argv); + return -1; + } + } else if (strcmp(*argv, "help") == 0) { + explain(); + return -1; + } else { + fprintf(stderr, "tbs: unknown parameter \"%s\"\n", *argv); + explain(); + return -1; + } + argc--; argv++; + } + + tail = NLMSG_TAIL(n); + addattr_l(n, 1024, TCA_OPTIONS, NULL, 0); + addattr_l(n, 2024, TCA_TBS_PARMS, &opt, sizeof(opt)); + tail->rta_len = (void *) NLMSG_TAIL(n) - (void *) tail; + return 0; +} + +static int tbs_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt) +{ + struct rtattr *tb[TCA_TBS_MAX+1]; + struct tc_tbs_qopt *qopt; + + if (opt == NULL) + return 0; + + parse_rtattr_nested(tb, TCA_TBS_MAX, opt); + + if (tb[TCA_TBS_PARMS] == NULL) + return -1; + + qopt = RTA_DATA(tb[TCA_TBS_PARMS]); + if (RTA_PAYLOAD(tb[TCA_TBS_PARMS]) < sizeof(*
[PATCH net] gso: validate gso_type if SKB_GSO_DODGY
From: Willem de Bruijn Validate gso_type of untrusted SKB_GSO_DODGY packets during segmentation. Untrusted user packets are limited to a small set of gso types in virtio_net_hdr_to_skb. But segmentation occurs on packet contents. Syzkaller was able to enter gso callbacks that are not hardened against untrusted user input. Fixes: f43798c27684 ("tun: Allow GSO using virtio_net_hdr") Link: http://lkml.kernel.org/r/<001a1137452496ffc305617e5...@google.com> Reported-by: syzbot+fee64147a25aecd48...@syzkaller.appspotmail.com Signed-off-by: Willem de Bruijn --- net/ipv4/af_inet.c | 21 + net/ipv6/ip6_offload.c | 23 ++- 2 files changed, 43 insertions(+), 1 deletion(-) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index f00499a46927..d5a36827f7b1 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1220,6 +1220,25 @@ int inet_sk_rebuild_header(struct sock *sk) } EXPORT_SYMBOL(inet_sk_rebuild_header); +static bool inet_gso_validate_dodgy(struct sk_buff *skb, int ipproto) +{ + unsigned int gso_type = skb_shinfo(skb)->gso_type; + + if (gso_type & SKB_GSO_DODGY) { + switch (gso_type & ~SKB_GSO_DODGY) { + case SKB_GSO_TCPV4: + case SKB_GSO_TCPV4 | SKB_GSO_TCP_ECN: + return ipproto == IPPROTO_TCP; + case SKB_GSO_UDP: + return ipproto == IPPROTO_UDP; + default: + return false; + } + } + + return true; +} + struct sk_buff *inet_gso_segment(struct sk_buff *skb, netdev_features_t features) { @@ -1245,6 +1264,8 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb, id = ntohs(iph->id); proto = iph->protocol; + if (!inet_gso_validate_dodgy(skb, proto)) + goto out; /* Warning: after this point, iph might be no longer valid */ if (unlikely(!pskb_may_pull(skb, ihl))) diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index 4a87f9428ca5..662e16548104 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -55,6 +55,25 @@ static int ipv6_gso_pull_exthdrs(struct sk_buff *skb, int proto) return proto; } +static bool ipv6_gso_validate_dodgy(struct sk_buff *skb, int ipproto) +{ + unsigned int gso_type = skb_shinfo(skb)->gso_type; + + if (gso_type & SKB_GSO_DODGY) { + switch (gso_type & ~SKB_GSO_DODGY) { + case SKB_GSO_TCPV6: + case SKB_GSO_TCPV6 | SKB_GSO_TCP_ECN: + return ipproto == IPPROTO_TCP; + case SKB_GSO_UDP: + return ipproto == IPPROTO_UDP; + default: + return false; + } + } + + return true; +} + static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb, netdev_features_t features) { @@ -82,10 +101,12 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb, ipv6h = ipv6_hdr(skb); __skb_pull(skb, sizeof(*ipv6h)); - segs = ERR_PTR(-EPROTONOSUPPORT); proto = ipv6_gso_pull_exthdrs(skb, ipv6h->nexthdr); + if (!ipv6_gso_validate_dodgy(skb, proto)) + goto out; + segs = ERR_PTR(-EPROTONOSUPPORT); if (skb->encapsulation && skb_shinfo(skb)->gso_type & (SKB_GSO_IPXIP4 | SKB_GSO_IPXIP6)) udpfrag = proto == IPPROTO_UDP && encap; -- 2.16.0.rc1.238.g530d649a79-goog
Re: [RFC PATCH net-next 00/12] selftests: forwarding: Add VRF-based tests
> >> However, a similar kind of flexibility can be achieved by using VRFs and > >> by looping the switch ports together. For example: > >> > >> br0 > >> + > >>vrf-h1 | vrf-h2 > >> ++---+++ > >> |||| > >> 192.0.2.1/24 ++++ 192.0.2.2/24 > >>swp1 swp2 swp3 swp4 > >> ++++ > >> |||| > >> ++++ > >> > Agreed this is really cool! For DSA enabled switches, we usually have a > host that does the test sequencing and then execute commands remotely on > the DUT, but we might be able to get such a similar framework up and > running on the DUT itself without too much hassle. I think the problem we will have is a lack of ports. Most DSA switches have 4 or 5 ports. Given the need for two ports per bridge port, we will be limited to bridges with just two members. That really limits what sort of tests you can do. But for top or rack switches, 16 ports, 8 loopback cables, that does give interesting setups. If i were writing tests for that class of routers, that would be the hardware setup i would define. Andrew
[RFC v2 net-next 03/10] net: ipv4: udp: Hook into time based transmission.
From: Richard Cochran For udp packets, copy the desired future transmit time from the CMSG cookie into the skb. Signed-off-by: Richard Cochran Signed-off-by: Jesus Sanchez-Palencia --- net/ipv4/udp.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 853321555a4e..89257422503a 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -926,6 +926,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) } ipc.sockc.tsflags = sk->sk_tsflags; + ipc.sockc.transmit_time = 0; ipc.addr = inet->inet_saddr; ipc.oif = sk->sk_bound_dev_if; @@ -1027,8 +1028,10 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) sizeof(struct udphdr), &ipc, &rt, msg->msg_flags); err = PTR_ERR(skb); - if (!IS_ERR_OR_NULL(skb)) + if (!IS_ERR_OR_NULL(skb)) { + skb->tstamp = ipc.sockc.transmit_time; err = udp_send_skb(skb, fl4); + } goto out; } -- 2.15.1
[RFC v2 net-next 05/10] net/sched: Allow creating a Qdisc watchdog with other clocks
From: Vinicius Costa Gomes This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be passed, this allows other time references to be used when scheduling the Qdisc to run. Signed-off-by: Vinicius Costa Gomes --- include/net/pkt_sched.h | 2 ++ net/sched/sch_api.c | 11 +-- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h index 815b92a23936..2466ea143d01 100644 --- a/include/net/pkt_sched.h +++ b/include/net/pkt_sched.h @@ -72,6 +72,8 @@ struct qdisc_watchdog { struct Qdisc*qdisc; }; +void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc *qdisc, +clockid_t clockid); void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc); void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires); diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index d512f49ee83c..d6136920b5c1 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -596,12 +596,19 @@ static enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer) return HRTIMER_NORESTART; } -void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc) +void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc *qdisc, +clockid_t clockid) { - hrtimer_init(&wd->timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED); + hrtimer_init(&wd->timer, clockid, HRTIMER_MODE_ABS_PINNED); wd->timer.function = qdisc_watchdog; wd->qdisc = qdisc; } +EXPORT_SYMBOL(qdisc_watchdog_init_clockid); + +void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc) +{ + qdisc_watchdog_init_clockid(wd, qdisc, CLOCK_MONOTONIC); +} EXPORT_SYMBOL(qdisc_watchdog_init); void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires) -- 2.15.1
[RFC v2 net-next 02/10] net: ipv4: raw: Hook into time based transmission.
From: Richard Cochran For raw packets, copy the desired future transmit time from the CMSG cookie into the skb. Signed-off-by: Richard Cochran Signed-off-by: Jesus Sanchez-Palencia --- net/ipv4/raw.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 136544b36a46..e37ea8ab6a78 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -381,6 +381,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4, skb->priority = sk->sk_priority; skb->mark = sk->sk_mark; + skb->tstamp = sockc->transmit_time; skb_dst_set(skb, &rt->dst); *rtp = NULL; @@ -562,6 +563,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) } ipc.sockc.tsflags = sk->sk_tsflags; + ipc.sockc.transmit_time = 0; ipc.addr = inet->inet_saddr; ipc.opt = NULL; ipc.tx_flags = 0; -- 2.15.1
[PATCH net-next] net: core: Expose number of link up/down transitions
From: David Decotigny Expose the number of times the link has been going UP or DOWN, and update the "carrier_changes" counter to be the sum of these two events. While at it, also update the sysfs-class-net documentation to cover: carrier_changes (3.15), count_link_up (4.16) and count_link_down (4.16) Signed-off-by: David Decotigny [Florian: * rebase * add documentation * merge carrier_changes with up/down counters] Signed-off-by: Florian Fainelli --- Documentation/ABI/testing/sysfs-class-net | 24 include/linux/netdevice.h | 6 -- include/uapi/linux/if_link.h | 2 ++ net/core/net-sysfs.c | 23 ++- net/core/rtnetlink.c | 13 +++-- net/sched/sch_generic.c | 4 ++-- 6 files changed, 65 insertions(+), 7 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net index 6856da99b6f7..e4b0d5157305 100644 --- a/Documentation/ABI/testing/sysfs-class-net +++ b/Documentation/ABI/testing/sysfs-class-net @@ -259,3 +259,27 @@ Contact: netdev@vger.kernel.org Description: Symbolic link to the PHY device this network device is attached to. + +What: /sys/class/net/carrier_changes)); + atomic_read(&netdev->count_link_up) + + atomic_read(&netdev->count_link_down)); } static DEVICE_ATTR_RO(carrier_changes); +static ssize_t count_link_up_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct net_device *netdev = to_net_dev(dev); + + return sprintf(buf, fmt_dec, atomic_read(&netdev->count_link_up)); +} +static DEVICE_ATTR_RO(count_link_up); + +static ssize_t count_link_down_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct net_device *netdev = to_net_dev(dev); + + return sprintf(buf, fmt_dec, atomic_read(&netdev->count_link_down)); +} +static DEVICE_ATTR_RO(count_link_down); + /* read-write attributes */ static int change_mtu(struct net_device *dev, unsigned long new_mtu) @@ -547,6 +566,8 @@ static struct attribute *net_class_attrs[] __ro_after_init = { &dev_attr_phys_port_name.attr, &dev_attr_phys_switch_id.attr, &dev_attr_proto_down.attr, + &dev_attr_count_link_up.attr, + &dev_attr_count_link_down.attr, NULL, }; ATTRIBUTE_GROUPS(net_class); diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 16d644a4f974..68f69a956713 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -990,6 +990,8 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev, + nla_total_size(4) /* IFLA_NEW_NETNSID */ + nla_total_size(1) /* IFLA_PROTO_DOWN */ + nla_total_size(4) /* IFLA_IF_NETNSID */ + + nla_total_size(4) /* IFLA_COUNT_LINK_UP */ + + nla_total_size(4) /* IFLA_COUNT_LINK_DOWN */ + 0; } @@ -1551,8 +1553,13 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, nla_put_string(skb, IFLA_QDISC, dev->qdisc->ops->id)) || nla_put_ifalias(skb, dev) || nla_put_u32(skb, IFLA_CARRIER_CHANGES, - atomic_read(&dev->carrier_changes)) || - nla_put_u8(skb, IFLA_PROTO_DOWN, dev->proto_down)) + atomic_read(&dev->count_link_up) + + atomic_read(&dev->count_link_down)) || + nla_put_u8(skb, IFLA_PROTO_DOWN, dev->proto_down) || + nla_put_u32(skb, IFLA_COUNT_LINK_UP, + atomic_read(&dev->count_link_up)) || + nla_put_u32(skb, IFLA_COUNT_LINK_DOWN, + atomic_read(&dev->count_link_down))) goto nla_put_failure; if (event != IFLA_EVENT_NONE) { @@ -1656,6 +1663,8 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = { [IFLA_EVENT]= { .type = NLA_U32 }, [IFLA_GROUP]= { .type = NLA_U32 }, [IFLA_IF_NETNSID] = { .type = NLA_S32 }, + [IFLA_COUNT_LINK_UP]= { .type = NLA_U32 }, + [IFLA_COUNT_LINK_DOWN] = { .type = NLA_U32 }, }; static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = { diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index ef8b4ecde2ac..28941636afa3 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -510,7 +510,7 @@ void netif_carrier_on(struct net_device *dev) if (test_and_clear_bit(__LINK_STATE_NOCARRIER, &dev->state)) { if (dev->reg_state == NETREG_UNINITIALIZED) return; - atomic_inc(&dev->carrier_changes); + atomic_inc(&dev->count_link_up); linkwatch_fire_event(dev); if (netif_running(dev)) __n
[RFC v2 net-next 06/10] net/sched: Introduce the TBS Qdisc
From: Vinicius Costa Gomes TBS (Time Based Scheduler) uses the information added earlier in this series (the socket option SO_TXTIME and the new role of sk_buff->tstamp) to schedule traffic transmission based on absolute time. For some workloads, just bandwidth enforcement is not enough, and precise control of the transmission of packets is necessary. Example: $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 tbs delta 6 clockid 11 offload 1 In this example, the Qdisc will try to enable offloading (offload 1) the control of the transmission time to the network adapter, the time stamp in socket are in reference to the clockid '11' (CLOCK_TAI) and packets leave the Qdisc "delta" (6) nanoseconds before its transmission time. When offloading is disabled, the network adapter will ignore the sk_buff time stamp, and so, the transmission time will be only "best effort" from the Qdisc. Signed-off-by: Vinicius Costa Gomes Signed-off-by: Jesus Sanchez-Palencia --- include/linux/netdevice.h | 1 + include/net/pkt_sched.h| 5 + include/uapi/linux/pkt_sched.h | 17 ++ net/sched/Kconfig | 11 ++ net/sched/Makefile | 1 + net/sched/sch_tbs.c| 392 + 6 files changed, 427 insertions(+) create mode 100644 net/sched/sch_tbs.c diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index ed0799a12bf2..e87031bd108e 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -781,6 +781,7 @@ enum tc_setup_type { TC_SETUP_QDISC_CBS, TC_SETUP_QDISC_RED, TC_SETUP_QDISC_PRIO, + TC_SETUP_QDISC_TBS, }; /* These structures hold the attributes of bpf state that are being passed diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h index 2466ea143d01..d042ffda7f21 100644 --- a/include/net/pkt_sched.h +++ b/include/net/pkt_sched.h @@ -155,4 +155,9 @@ struct tc_cbs_qopt_offload { s32 sendslope; }; +struct tc_tbs_qopt_offload { + u8 enable; + s32 queue; +}; + #endif diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h index 37b5096ae97b..6bb39944ba32 100644 --- a/include/uapi/linux/pkt_sched.h +++ b/include/uapi/linux/pkt_sched.h @@ -934,4 +934,21 @@ enum { #define TCA_CBS_MAX (__TCA_CBS_MAX - 1) + +/* TBS */ +struct tc_tbs_qopt { + __u8 offload; + __u8 _pad[3]; + __s32 delta; + __s32 clockid; +}; + +enum { + TCA_TBS_UNSPEC, + TCA_TBS_PARMS, + __TCA_TBS_MAX, +}; + +#define TCA_TBS_MAX (__TCA_TBS_MAX - 1) + #endif diff --git a/net/sched/Kconfig b/net/sched/Kconfig index c03d86a7775e..7d54045995a3 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -183,6 +183,17 @@ config NET_SCH_CBS To compile this code as a module, choose M here: the module will be called sch_cbs. +config NET_SCH_TBS + tristate "Time Based Scheduler (TBS)" + ---help--- + Say Y here if you want to use the Time Based Scheduler (TBS) packet + scheduling algorithm. + + See the top of for more details. + + To compile this code as a module, choose M here: the + module will be called sch_tbs. + config NET_SCH_GRED tristate "Generic Random Early Detection (GRED)" ---help--- diff --git a/net/sched/Makefile b/net/sched/Makefile index 5b635447e3f8..0f7f29505c89 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -54,6 +54,7 @@ obj-$(CONFIG_NET_SCH_FQ) += sch_fq.o obj-$(CONFIG_NET_SCH_HHF) += sch_hhf.o obj-$(CONFIG_NET_SCH_PIE) += sch_pie.o obj-$(CONFIG_NET_SCH_CBS) += sch_cbs.o +obj-$(CONFIG_NET_SCH_TBS) += sch_tbs.o obj-$(CONFIG_NET_CLS_U32) += cls_u32.o obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o diff --git a/net/sched/sch_tbs.c b/net/sched/sch_tbs.c new file mode 100644 index ..300456063ac9 --- /dev/null +++ b/net/sched/sch_tbs.c @@ -0,0 +1,392 @@ +/* + * net/sched/sch_tbs.c Time Based Shaper + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Authors:Vinicius Costa Gomes + * Jesus Sanchez-Palencia + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct tbs_sched_data { + bool offload; + int clockid; + int queue; + s32 delta; /* in ns */ + ktime_t last; /* The txtime of the last skb sent to the netdevice. */ + struct rb_root head; + struct qdisc_watchdog watchdog; + struct Qdisc *qdisc; +}; + +static
[RFC v2 net-next 04/10] net: packet: Hook into time based transmission.
From: Richard Cochran For raw layer-2 packets, copy the desired future transmit time from the CMSG cookie into the skb. Signed-off-by: Richard Cochran Signed-off-by: Jesus Sanchez-Palencia --- net/packet/af_packet.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 05d31864a34e..6af129a65e85 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1976,6 +1976,7 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg, goto out_unlock; } + sockc.transmit_time = 0; sockc.tsflags = sk->sk_tsflags; if (msg->msg_controllen) { err = sock_cmsg_send(sk, msg, &sockc); @@ -1987,6 +1988,7 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg, skb->dev = dev; skb->priority = sk->sk_priority; skb->mark = sk->sk_mark; + skb->tstamp = sockc.transmit_time; sock_tx_timestamp(sk, sockc.tsflags, &skb_shinfo(skb)->tx_flags); @@ -2484,6 +2486,7 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb, skb->dev = dev; skb->priority = po->sk.sk_priority; skb->mark = po->sk.sk_mark; + skb->tstamp = sockc->transmit_time; sock_tx_timestamp(&po->sk, sockc->tsflags, &skb_shinfo(skb)->tx_flags); skb_shinfo(skb)->destructor_arg = ph.raw; @@ -2660,6 +2663,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) if (unlikely(!(dev->flags & IFF_UP))) goto out_put; + sockc.transmit_time = 0; sockc.tsflags = po->sk.sk_tsflags; if (msg->msg_controllen) { err = sock_cmsg_send(&po->sk, msg, &sockc); @@ -2856,6 +2860,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) if (unlikely(!(dev->flags & IFF_UP))) goto out_unlock; + sockc.transmit_time = 0; sockc.tsflags = sk->sk_tsflags; sockc.mark = sk->sk_mark; if (msg->msg_controllen) { @@ -2928,6 +2933,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) skb->dev = dev; skb->priority = sk->sk_priority; skb->mark = sockc.mark; + skb->tstamp = sockc.transmit_time; if (has_vnet_hdr) { err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le()); -- 2.15.1