Re: DPAA Ethernet problems with mainstream Linux kernels

2018-01-17 Thread Christian Zigotzky
Hi Jamie,

Many thanks for your effort. If you need a new kernel for testing please let me 
know.

Cheers,
Christian

Sent from my iPhone

> On 18. Jan 2018, at 02:59, Jamie Krueger  
> wrote:
> 
> Hi Madalin,
> 
> On 01/16/2018 11:33 AM, Madalin-cristian Bucur wrote:
> 
> Here are some results from testing on the X5000/20 with 4.15.0-rc8
> (w/CONFIG_FSL_PAMU disabled):
> 
> In my tests I can now get the interface to obtain an IP Address via DHCP,
> at least *part* of the time that is.
> 
> Here is a link to a screenshot of a Wireshark capture which shows
> a successful DHCP request/response for the X5000/20, followed by some
> failed ping attempts.
> 
> (This traffic was captured on an external machine on the same subnet, and not 
> from the X5000/20 itself)
> http://bitbybitsoftwaregroup.com/share/X5000-DHCP-Answered-Wireshark.png
> 
> Additionally, here is a link to the export of the packet data shown in the 
> above image:
> http://bitbybitsoftwaregroup.com/share/Wireshark-X5000-DHCP-Success.html
> 
> As you can see by this example, even when the interface does obtain an IP 
> address via
> DHCP (about half the time), subsequent traffic still fails to receive 
> responses.
> 
> In this example, I attempt to ping the gateway (192.168.1.1), and one or two 
> other
> local addresses, and nothing pings back (even when trying Skateman's 
> suggestion
> of expanding the buffers for the interface -
> 
> "A workaround to keep the nic alive a bit longer is the following command
> ip link set eth0 qlen 1"
> 
> For reference:
> 
> I have attached the Kernel config that the version I tested was built with.
> Also, I have attached the dmesg output I am currently seeing.
> 
> -- 
> Best Regards,
> 
> Jamie Krueger
> BITbyBIT Software Group LLC
> 
> 
> 



Re: [RFC PATCH net-next 02/12] selftests: forwarding: Add a test for FDB learning

2018-01-17 Thread Jiri Pirko
Thu, Jan 18, 2018 at 08:16:54AM CET, ido...@idosch.org wrote:
>On Wed, Jan 17, 2018 at 04:15:04PM -0800, David Ahern wrote:
>> How about the attached - make ping, ping6 and mz variables? Default to
>> ping and mausezahn but let users update that name if necessary. Comments
>> in the sample config to the effect need to be added to help users.
>
>Yes, looks good to me. Will include the change in the next version.

Ido, please keep in mind that the script should run without the config
file. So the defaults should be in lib.sh.

Thanks!


Re: [RFC PATCH] e1000e: Remove Other from EIAC.

2018-01-17 Thread Benjamin Poirier
On 2018/01/18 15:50, Benjamin Poirier wrote:
> It was reported that emulated e1000e devices in vmware esxi 6.5 Build
> 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
> overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
> e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
> on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
> icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation.
 (_INT_ASSERTED | _LSC)

> 
> Some experimentation showed that this flaw in vmware e1000e emulation can
> be worked around by not setting Other in EIAC. This is how it was before
> 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1).
> 
> Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
> Signed-off-by: Benjamin Poirier 


Re: [RFC PATCH net-next 00/12] selftests: forwarding: Add VRF-based tests

2018-01-17 Thread Ido Schimmel
Hi Andrew, Florian

On Thu, Jan 18, 2018 at 12:11:11AM +0100, Andrew Lunn wrote:
> > >> However, a similar kind of flexibility can be achieved by using VRFs and
> > >> by looping the switch ports together. For example:
> > >>
> > >>  br0
> > >>   +
> > >>vrf-h1 |   vrf-h2
> > >>  ++---+++
> > >>  ||||
> > >> 192.0.2.1/24 ++++ 192.0.2.2/24
> > >>swp1 swp2 swp3 swp4
> > >>  ++++
> > >>  ||||
> > >>  ++++
> > >>
> 
> > Agreed this is really cool! For DSA enabled switches, we usually have a
> > host that does the test sequencing and then execute commands remotely on
> > the DUT, but we might be able to get such a similar framework up and
> > running on the DUT itself without too much hassle.
> 
> I think the problem we will have is a lack of ports. Most DSA switches
> have 4 or 5 ports. Given the need for two ports per bridge port, we
> will be limited to bridges with just two members. That really limits
> what sort of tests you can do.

I was actually interested in feedback from you guys. Looking at
dsa_slave_changeupper() I see you don't forbid the enslavement to a VRF
and that you set STP state to forwarding when a port leaves a bridge
(good). Does that mean you're able to use some of these tests on your
switches?

The reason we can use these tests for mlxsw is that we support VRF and
ACL offload. At least in the above example, swp4 is able to receive
packets directed at 192.0.2.2 because we program the device with the
host route 192.0.2.2/32.


Re: [RFC PATCH] e1000e: Remove Other from EIAC.

2018-01-17 Thread Benjamin Poirier
On 2018/01/18 15:50, Benjamin Poirier wrote:
> It was reported that emulated e1000e devices in vmware esxi 6.5 Build
> 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
> overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
> e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
> on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
> icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation.
> 
> Some experimentation showed that this flaw in vmware e1000e emulation can
> be worked around by not setting Other in EIAC. This is how it was before
> 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1).

vmware folks, please comment.


Re: AIO operation and CMSG

2018-01-17 Thread Stephan Mueller
Am Mittwoch, 17. Januar 2018, 20:22:13 CET schrieb Christoph Hellwig:

Hi Christoph,

> On Sun, Jan 14, 2018 at 09:01:00AM +0100, Stephan Müller wrote:
> > The syscall io_submit sends data to the kernel and invokes the respective
> > handler function in the kernel such as the recvmsg handler. What I am
> > wondering is whether there is a way to send CMSG data along with the
> > io_submit syscall? If not, is CMSG handling with the AIO syscalls
> > possible at all?
> Not as-is, but it could be easily added by repurposing unused fields
> in the iocb.  The big question is if it should be done to the existing
> IOCB_CMD_PREADV/IOCB_CMD_PWRITEV types, or if new SENDMSG/RECVMSG ones
> should be added instead.

Thank you for the clarification. I think for the moment we have found another 
solution that we are discussing at linux-crypto. Therefore, I would currently 
not see the need for an additional support of CMSG in AIO.

Ciao
Stephan




Re: [RFC PATCH net-next 02/12] selftests: forwarding: Add a test for FDB learning

2018-01-17 Thread Ido Schimmel
On Wed, Jan 17, 2018 at 04:15:04PM -0800, David Ahern wrote:
> How about the attached - make ping, ping6 and mz variables? Default to
> ping and mausezahn but let users update that name if necessary. Comments
> in the sample config to the effect need to be added to help users.

Yes, looks good to me. Will include the change in the next version.

Thanks!


[RFC PATCH] e1000e: Remove Other from EIAC.

2018-01-17 Thread Benjamin Poirier
It was reported that emulated e1000e devices in vmware esxi 6.5 Build
7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation.

Some experimentation showed that this flaw in vmware e1000e emulation can
be worked around by not setting Other in EIAC. This is how it was before
16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1).

Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
Signed-off-by: Benjamin Poirier 
---

Jeff, I'm sending as RFC since it looks like a problem that should be fixed
in vmware. If you'd like to have the workaround in e1000e, I'll submit.

---
 drivers/net/ethernet/intel/e1000e/netdev.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 9f18d39bdc8f..625a4c9a86a4 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1918,6 +1918,8 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
bool enable = true;
 
icr = er32(ICR);
+   ew32(ICR, E1000_ICR_OTHER);
+
if (icr & E1000_ICR_RXO) {
ew32(ICR, E1000_ICR_RXO);
enable = false;
@@ -2040,7 +2042,6 @@ static void e1000_configure_msix(struct e1000_adapter 
*adapter)
   hw->hw_addr + E1000_EITR_82574(vector));
else
writel(1, hw->hw_addr + E1000_EITR_82574(vector));
-   adapter->eiac_mask |= E1000_IMS_OTHER;
 
/* Cause Tx interrupts on every write back */
ivar |= BIT(31);
@@ -2265,7 +2266,7 @@ static void e1000_irq_enable(struct e1000_adapter 
*adapter)
 
if (adapter->msix_entries) {
ew32(EIAC_82574, adapter->eiac_mask & E1000_EIAC_MASK_82574);
-   ew32(IMS, adapter->eiac_mask | E1000_IMS_LSC);
+   ew32(IMS, adapter->eiac_mask | E1000_IMS_OTHER | E1000_IMS_LSC);
} else if (hw->mac.type >= e1000_pch_lpt) {
ew32(IMS, IMS_ENABLE_MASK | E1000_IMS_ECCER);
} else {
-- 
2.15.1



[PATCH net] netlink: reset extack earlier in netlink_rcv_skb

2018-01-17 Thread Xin Long
Move up the extack reset/initialization in netlink_rcv_skb, so that
those 'goto ack' will not skip it. Otherwise, later on netlink_ack
may use the uninitialized extack and cause kernel crash.

Fixes: cbbdf8433a5f ("netlink: extack needs to be reset each time through loop")
Reported-by: syzbot+03bee3680a3746677...@syzkaller.appspotmail.com
Signed-off-by: Xin Long 
---
 net/netlink/af_netlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 47ef2d8..84a4e4c 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2391,6 +2391,7 @@ int netlink_rcv_skb(struct sk_buff *skb, int (*cb)(struct 
sk_buff *,
while (skb->len >= nlmsg_total_size(0)) {
int msglen;
 
+   memset(&extack, 0, sizeof(extack));
nlh = nlmsg_hdr(skb);
err = 0;
 
@@ -2405,7 +2406,6 @@ int netlink_rcv_skb(struct sk_buff *skb, int (*cb)(struct 
sk_buff *,
if (nlh->nlmsg_type < NLMSG_MIN_TYPE)
goto ack;
 
-   memset(&extack, 0, sizeof(extack));
err = cb(skb, nlh, &extack);
if (err == -EINTR)
goto skip;
-- 
2.1.0



[PATCH net-next 0/5] add some features to hns3 driver

2018-01-17 Thread Peng Li
This patchset adds some features to hns3 driver, include the support
for ethtool command -d, -p and support for manager table.

[Patch 1/5] adds support for ethtool command -d, its ops is get_regs.
driver will send command to command queue, and get regs number and
regs value from command queue.
[Patch 2/5] adds manager table initialization for hardware.
[Patch 3/5 - 4/5] add support for ethtool command -p. Phy will control
leds for copper ports. For fiber ports, driver sends command to
command queue, and IMP will write SGPIO regs to control leds.
[Patch 4/5] adds support for net status led for fiber ports. Net status
include  port speed, total rx/tx packets and link status. Driver send
the status to command queue, and IMP will write SGPIO to control leds.

Fuyun Liang (2):
  net: hns3: add support for get_regs
  net: hns3: add manager table initialization for hardware

Jian Shen (3):
  net: hns3: add ethtool -p support for phy device
  net: hns3: add ethtool -p support for fiber port
  net: hns3: add net status led support for fiber port

 drivers/net/ethernet/hisilicon/hns3/hnae3.h|   5 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c |  35 ++
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h |  47 ++
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 533 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|  14 +
 5 files changed, 633 insertions(+), 1 deletion(-)

-- 
2.9.3



[PATCH net-next 5/5] net: hns3: add net status led support for fiber port

2018-01-17 Thread Peng Li
From: Jian Shen 

Check the net status per second, include port speed, total rx/tx packets
and link status. Updating the led status for fiber port.

Signed-off-by: Jian Shen 
Signed-off-by: Peng Li 
---
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h |   1 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 109 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|   2 +
 3 files changed, 112 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
index 122f862..3fd10a6 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
@@ -115,6 +115,7 @@ enum hclge_opcode_type {
HCLGE_OPC_QUERY_LINK_STATUS = 0x0307,
HCLGE_OPC_CONFIG_MAX_FRM_SIZE   = 0x0308,
HCLGE_OPC_CONFIG_SPEED_DUP  = 0x0309,
+   HCLGE_OPC_STATS_MAC_TRAFFIC = 0x0314,
/* MACSEC command */
 
/* PFC/Pause CMD*/
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 039c90b..f3d9332 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -39,6 +39,7 @@ static int hclge_set_mta_filter_mode(struct hclge_dev *hdev,
 static int hclge_set_mtu(struct hnae3_handle *handle, int new_mtu);
 static int hclge_init_vlan_config(struct hclge_dev *hdev);
 static int hclge_reset_ae_dev(struct hnae3_ae_dev *ae_dev);
+static int hclge_update_led_status(struct hclge_dev *hdev);
 
 static struct hnae3_ae_algo ae_algo;
 
@@ -505,6 +506,38 @@ static int hclge_32_bit_update_stats(struct hclge_dev 
*hdev)
return 0;
 }
 
+static int hclge_mac_get_traffic_stats(struct hclge_dev *hdev)
+{
+   struct hclge_mac_stats *mac_stats = &hdev->hw_stats.mac_stats;
+   struct hclge_desc desc;
+   __le64 *desc_data;
+   int ret;
+
+   /* for fiber port, need to query the total rx/tx packets statstics,
+* used for data transferring checking.
+*/
+   if (hdev->hw.mac.media_type != HNAE3_MEDIA_TYPE_FIBER)
+   return 0;
+
+   if (test_bit(HCLGE_STATE_STATISTICS_UPDATING, &hdev->state))
+   return 0;
+
+   hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_STATS_MAC_TRAFFIC, true);
+   ret = hclge_cmd_send(&hdev->hw, &desc, 1);
+   if (ret) {
+   dev_err(&hdev->pdev->dev,
+   "Get MAC total pkt stats fail, ret = %d\n", ret);
+
+   return ret;
+   }
+
+   desc_data = (__le64 *)(&desc.data[0]);
+   mac_stats->mac_tx_total_pkt_num += le64_to_cpu(*desc_data++);
+   mac_stats->mac_rx_total_pkt_num += le64_to_cpu(*desc_data);
+
+   return 0;
+}
+
 static int hclge_mac_update_stats(struct hclge_dev *hdev)
 {
 #define HCLGE_MAC_CMD_NUM 21
@@ -2846,13 +2879,20 @@ static void hclge_service_task(struct work_struct *work)
struct hclge_dev *hdev =
container_of(work, struct hclge_dev, service_task);
 
+   /* The total rx/tx packets statstics are wanted to be updated
+* per second. Both hclge_update_stats_for_all() and
+* hclge_mac_get_traffic_stats() can do it.
+*/
if (hdev->hw_stats.stats_timer >= HCLGE_STATS_TIMER_INTERVAL) {
hclge_update_stats_for_all(hdev);
hdev->hw_stats.stats_timer = 0;
+   } else {
+   hclge_mac_get_traffic_stats(hdev);
}
 
hclge_update_speed_duplex(hdev);
hclge_update_link_status(hdev);
+   hclge_update_led_status(hdev);
hclge_service_complete(hdev);
 }
 
@@ -5965,6 +6005,75 @@ static int hclge_set_led_id(struct hnae3_handle *handle,
return ret;
 }
 
+enum hclge_led_port_speed {
+   HCLGE_SPEED_LED_FOR_1G,
+   HCLGE_SPEED_LED_FOR_10G,
+   HCLGE_SPEED_LED_FOR_25G,
+   HCLGE_SPEED_LED_FOR_40G,
+   HCLGE_SPEED_LED_FOR_50G,
+   HCLGE_SPEED_LED_FOR_100G,
+};
+
+static u8 hclge_led_get_speed_status(u32 speed)
+{
+   u8 speed_led;
+
+   switch (speed) {
+   case HCLGE_MAC_SPEED_1G:
+   speed_led = HCLGE_SPEED_LED_FOR_1G;
+   break;
+   case HCLGE_MAC_SPEED_10G:
+   speed_led = HCLGE_SPEED_LED_FOR_10G;
+   break;
+   case HCLGE_MAC_SPEED_25G:
+   speed_led = HCLGE_SPEED_LED_FOR_25G;
+   break;
+   case HCLGE_MAC_SPEED_40G:
+   speed_led = HCLGE_SPEED_LED_FOR_40G;
+   break;
+   case HCLGE_MAC_SPEED_50G:
+   speed_led = HCLGE_SPEED_LED_FOR_50G;
+   break;
+   case HCLGE_MAC_SPEED_100G:
+   speed_led = HCLGE_SPEED_LED_FOR_100G;
+   break;
+   default:
+   speed_led = HCLGE_LED_NO_CHANGE;
+   }
+
+   return speed_led;
+}
+
+static int hclge_update_led_status(struct hclge_dev *hdev)
+{
+  

[PATCH net-next 1/5] net: hns3: add support for get_regs

2018-01-17 Thread Peng Li
From: Fuyun Liang 

This patch adds get_regs support for ethtool cmd.

Signed-off-by: Fuyun Liang 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|   3 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c |  23 +++
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h |   4 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 176 +
 4 files changed, 205 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index 634e932..d104ce5 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -356,7 +356,8 @@ struct hnae3_ae_ops {
u32 stringset, u8 *data);
int (*get_sset_count)(struct hnae3_handle *handle, int stringset);
 
-   void (*get_regs)(struct hnae3_handle *handle, void *data);
+   void (*get_regs)(struct hnae3_handle *handle, u32 *version,
+void *data);
int (*get_regs_len)(struct hnae3_handle *handle);
 
u32 (*get_rss_key_size)(struct hnae3_handle *handle);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
index 358f780..1c8b293 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
@@ -1063,6 +1063,27 @@ static int hns3_set_coalesce(struct net_device *netdev,
return 0;
 }
 
+static int hns3_get_regs_len(struct net_device *netdev)
+{
+   struct hnae3_handle *h = hns3_get_handle(netdev);
+
+   if (!h->ae_algo->ops->get_regs_len)
+   return -EOPNOTSUPP;
+
+   return h->ae_algo->ops->get_regs_len(h);
+}
+
+static void hns3_get_regs(struct net_device *netdev,
+ struct ethtool_regs *cmd, void *data)
+{
+   struct hnae3_handle *h = hns3_get_handle(netdev);
+
+   if (!h->ae_algo->ops->get_regs)
+   return;
+
+   h->ae_algo->ops->get_regs(h, &cmd->version, data);
+}
+
 static const struct ethtool_ops hns3vf_ethtool_ops = {
.get_drvinfo = hns3_get_drvinfo,
.get_ringparam = hns3_get_ringparam,
@@ -1103,6 +1124,8 @@ static const struct ethtool_ops hns3_ethtool_ops = {
.set_channels = hns3_set_channels,
.get_coalesce = hns3_get_coalesce,
.set_coalesce = hns3_set_coalesce,
+   .get_regs_len = hns3_get_regs_len,
+   .get_regs = hns3_get_regs,
 };
 
 void hns3_ethtool_set_ops(struct net_device *netdev)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
index 3c3159b..2561e7a 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
@@ -102,6 +102,10 @@ enum hclge_opcode_type {
HCLGE_OPC_STATS_64_BIT  = 0x0030,
HCLGE_OPC_STATS_32_BIT  = 0x0031,
HCLGE_OPC_STATS_MAC = 0x0032,
+
+   HCLGE_OPC_QUERY_REG_NUM = 0x0040,
+   HCLGE_OPC_QUERY_32_BIT_REG  = 0x0041,
+   HCLGE_OPC_QUERY_64_BIT_REG  = 0x0042,
/* Device management command */
 
/* MAC commond */
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 27f0ab6..c3d2cca 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -5544,6 +5544,180 @@ static int hclge_set_channels(struct hnae3_handle 
*handle, u32 new_tqps_num)
return ret;
 }
 
+static int hclge_get_regs_num(struct hclge_dev *hdev, u32 *regs_num_32_bit,
+ u32 *regs_num_64_bit)
+{
+   struct hclge_desc desc;
+   u32 total_num;
+   int ret;
+
+   hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_QUERY_REG_NUM, true);
+   ret = hclge_cmd_send(&hdev->hw, &desc, 1);
+   if (ret) {
+   dev_err(&hdev->pdev->dev,
+   "Query register number cmd failed, ret = %d.\n", ret);
+   return ret;
+   }
+
+   *regs_num_32_bit = le32_to_cpu(desc.data[0]);
+   *regs_num_64_bit = le32_to_cpu(desc.data[1]);
+
+   total_num = *regs_num_32_bit + *regs_num_64_bit;
+   if (!total_num)
+   return -EINVAL;
+
+   return 0;
+}
+
+static int hclge_get_32_bit_regs(struct hclge_dev *hdev, u32 regs_num,
+void *data)
+{
+#define HCLGE_32_BIT_REG_RTN_DATANUM 8
+
+   struct hclge_desc *desc;
+   u32 *reg_val = data;
+   __le32 *desc_data;
+   int cmd_num;
+   int i, k, n;
+   int ret;
+
+   if (regs_num == 0)
+   return 0;
+
+   cmd_num = DIV_ROUND_UP(regs_num + 2, HCLGE_32_BIT_REG_RTN_DATANUM);
+   desc = kcalloc(cmd_num, sizeof(struct hclge_desc), GFP_KERNEL);
+   if (!desc)
+   return -ENOMEM;
+
+   hc

[PATCH net-next 3/5] net: hns3: add ethtool -p support for phy device

2018-01-17 Thread Peng Li
From: Jian Shen 

Add led location support for phy device. The led will keep blinking
with frequency 2HZ when locating.

Signed-off-by: Jian Shen 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|  2 +
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 12 +++
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 88 ++
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 12 +++
 4 files changed, 114 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index d104ce5..fd06bc7 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -405,6 +405,8 @@ struct hnae3_ae_ops {
int (*set_channels)(struct hnae3_handle *handle, u32 new_tqps_num);
void (*get_flowctrl_adv)(struct hnae3_handle *handle,
 u32 *flowctrl_adv);
+   int (*set_led_id)(struct hnae3_handle *handle,
+ enum ethtool_phys_id_state status);
 };
 
 struct hnae3_dcb_ops {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
index 1c8b293..7410205 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
@@ -1084,6 +1084,17 @@ static void hns3_get_regs(struct net_device *netdev,
h->ae_algo->ops->get_regs(h, &cmd->version, data);
 }
 
+static int hns3_set_phys_id(struct net_device *netdev,
+   enum ethtool_phys_id_state state)
+{
+   struct hnae3_handle *h = hns3_get_handle(netdev);
+
+   if (!h->ae_algo || !h->ae_algo->ops || !h->ae_algo->ops->set_led_id)
+   return -EOPNOTSUPP;
+
+   return h->ae_algo->ops->set_led_id(h, state);
+}
+
 static const struct ethtool_ops hns3vf_ethtool_ops = {
.get_drvinfo = hns3_get_drvinfo,
.get_ringparam = hns3_get_ringparam,
@@ -1126,6 +1137,7 @@ static const struct ethtool_ops hns3_ethtool_ops = {
.set_coalesce = hns3_set_coalesce,
.get_regs_len = hns3_get_regs_len,
.get_regs = hns3_get_regs,
+   .set_phys_id = hns3_set_phys_id,
 };
 
 void hns3_ethtool_set_ops(struct net_device *netdev)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 6e64bed..73caf06 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -5819,6 +5819,93 @@ static void hclge_get_regs(struct hnae3_handle *handle, 
u32 *version,
"Get 64 bit register failed, ret = %d.\n", ret);
 }
 
+static int hclge_set_led_status_phy(struct phy_device *phydev, int value)
+{
+   int ret, cur_page;
+
+   mutex_lock(&phydev->lock);
+
+   ret = phy_read(phydev, HCLGE_PHY_PAGE_REG);
+   if (ret < 0)
+   goto out;
+   else
+   cur_page = ret;
+
+   ret = phy_write(phydev, HCLGE_PHY_PAGE_REG, HCLGE_PHY_PAGE_LED);
+   if (ret)
+   goto out;
+
+   ret = phy_write(phydev, HCLGE_LED_FC_REG, value);
+   if (ret)
+   goto out;
+
+   ret = phy_write(phydev, HCLGE_PHY_PAGE_REG, cur_page);
+
+out:
+   mutex_unlock(&phydev->lock);
+   return ret;
+}
+
+static int hclge_get_led_status_phy(struct phy_device *phydev, int *value)
+{
+   int ret, cur_page;
+
+   mutex_lock(&phydev->lock);
+
+   ret = phy_read(phydev, HCLGE_PHY_PAGE_REG);
+   if (ret < 0)
+   goto out;
+   else
+   cur_page = ret;
+
+   ret = phy_write(phydev, HCLGE_PHY_PAGE_REG, HCLGE_PHY_PAGE_LED);
+   if (ret)
+   goto out;
+
+   *value = phy_read(phydev, HCLGE_LED_FC_REG);
+
+   ret = phy_write(phydev, HCLGE_PHY_PAGE_REG, cur_page);
+
+out:
+   mutex_unlock(&phydev->lock);
+   return ret;
+}
+
+static int hclge_set_led_id(struct hnae3_handle *handle,
+   enum ethtool_phys_id_state status)
+{
+#define BLINK_FREQUENCY2
+   struct hclge_vport *vport = hclge_get_vport(handle);
+   struct hclge_dev *hdev = vport->back;
+   struct phy_device *phydev = hdev->hw.mac.phydev;
+   int ret = 0;
+
+   if (!phydev)
+   return -EOPNOTSUPP;
+
+   switch (status) {
+   case ETHTOOL_ID_ACTIVE:
+   ret = hclge_get_led_status_phy(phydev, &hdev->phy_led_val);
+   if (ret)
+   return ret;
+   return BLINK_FREQUENCY;
+   case ETHTOOL_ID_ON:
+   ret = hclge_set_led_status_phy(phydev, HCLGE_LED_FORCE_ON);
+   break;
+   case ETHTOOL_ID_OFF:
+   ret = hclge_set_led_status_phy(phydev, HCLGE_LED_FORCE_OFF);
+   break;
+   case ETHTOOL_ID_INACTIVE:
+   ret = hclge_set_led_status_phy(phydev, hdev->phy_led_val);
+

[PATCH net-next 4/5] net: hns3: add ethtool -p support for fiber port

2018-01-17 Thread Peng Li
From: Jian Shen 

Add led location support for fiber port. The led will keep blinking
when locating.

Signed-off-by: Jian Shen 
Signed-off-by: Peng Li 
---
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h |  20 
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 101 -
 2 files changed, 100 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
index 1cd28e0..122f862 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
@@ -227,6 +227,9 @@ enum hclge_opcode_type {
 
/* Mailbox cmd */
HCLGEVF_OPC_MBX_PF_TO_VF= 0x2000,
+
+   /* Led command */
+   HCLGE_OPC_LED_STATUS_CFG= 0xB000,
 };
 
 #define HCLGE_TQP_REG_OFFSET   0x8
@@ -807,6 +810,23 @@ struct hclge_reset_cmd {
 #define HCLGE_NIC_CMQ_DESC_NUM 1024
 #define HCLGE_NIC_CMQ_DESC_NUM_S   3
 
+#define HCLGE_LED_PORT_SPEED_STATE_S   0
+#define HCLGE_LED_PORT_SPEED_STATE_M   GENMASK(5, 0)
+#define HCLGE_LED_ACTIVITY_STATE_S 0
+#define HCLGE_LED_ACTIVITY_STATE_M GENMASK(1, 0)
+#define HCLGE_LED_LINK_STATE_S 0
+#define HCLGE_LED_LINK_STATE_M GENMASK(1, 0)
+#define HCLGE_LED_LOCATE_STATE_S   0
+#define HCLGE_LED_LOCATE_STATE_M   GENMASK(1, 0)
+
+struct hclge_set_led_state_cmd {
+   u8 port_speed_led_config;
+   u8 link_led_config;
+   u8 activity_led_config;
+   u8 locate_led_config;
+   u8 rsv[20];
+};
+
 int hclge_cmd_init(struct hclge_dev *hdev);
 static inline void hclge_write_reg(void __iomem *base, u32 reg, u32 value)
 {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 73caf06..039c90b 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -5819,6 +5819,34 @@ static void hclge_get_regs(struct hnae3_handle *handle, 
u32 *version,
"Get 64 bit register failed, ret = %d.\n", ret);
 }
 
+static int hclge_set_led_status_sfp(struct hclge_dev *hdev, u8 
speed_led_status,
+   u8 act_led_status, u8 link_led_status,
+   u8 locate_led_status)
+{
+   struct hclge_set_led_state_cmd *req;
+   struct hclge_desc desc;
+   int ret;
+
+   hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_LED_STATUS_CFG, false);
+
+   req = (struct hclge_set_led_state_cmd *)desc.data;
+   hnae_set_field(req->port_speed_led_config, HCLGE_LED_PORT_SPEED_STATE_M,
+  HCLGE_LED_PORT_SPEED_STATE_S, speed_led_status);
+   hnae_set_field(req->link_led_config, HCLGE_LED_ACTIVITY_STATE_M,
+  HCLGE_LED_ACTIVITY_STATE_S, act_led_status);
+   hnae_set_field(req->activity_led_config, HCLGE_LED_LINK_STATE_M,
+  HCLGE_LED_LINK_STATE_S, link_led_status);
+   hnae_set_field(req->locate_led_config, HCLGE_LED_LOCATE_STATE_M,
+  HCLGE_LED_LOCATE_STATE_S, locate_led_status);
+
+   ret = hclge_cmd_send(&hdev->hw, &desc, 1);
+   if (ret)
+   dev_err(&hdev->pdev->dev,
+   "Send set led state cmd error, ret =%d\n", ret);
+
+   return ret;
+}
+
 static int hclge_set_led_status_phy(struct phy_device *phydev, int value)
 {
int ret, cur_page;
@@ -5871,6 +5899,12 @@ static int hclge_get_led_status_phy(struct phy_device 
*phydev, int *value)
return ret;
 }
 
+enum hclge_led_status {
+   HCLGE_LED_OFF,
+   HCLGE_LED_ON,
+   HCLGE_LED_NO_CHANGE = 0xFF,
+};
+
 static int hclge_set_led_id(struct hnae3_handle *handle,
enum ethtool_phys_id_state status)
 {
@@ -5880,27 +5914,52 @@ static int hclge_set_led_id(struct hnae3_handle *handle,
struct phy_device *phydev = hdev->hw.mac.phydev;
int ret = 0;
 
-   if (!phydev)
-   return -EOPNOTSUPP;
-
-   switch (status) {
-   case ETHTOOL_ID_ACTIVE:
-   ret = hclge_get_led_status_phy(phydev, &hdev->phy_led_val);
-   if (ret)
-   return ret;
-   return BLINK_FREQUENCY;
-   case ETHTOOL_ID_ON:
-   ret = hclge_set_led_status_phy(phydev, HCLGE_LED_FORCE_ON);
-   break;
-   case ETHTOOL_ID_OFF:
-   ret = hclge_set_led_status_phy(phydev, HCLGE_LED_FORCE_OFF);
-   break;
-   case ETHTOOL_ID_INACTIVE:
-   ret = hclge_set_led_status_phy(phydev, hdev->phy_led_val);
-   break;
-   default:
-   ret = -EINVAL;
-   break;
+   if (phydev) {
+   switch (status) {
+   case ETHTOOL_ID_ACTIVE:
+   ret = hclge_get_led_status_phy(phydev,
+ 

[PATCH net-next 2/5] net: hns3: add manager table initialization for hardware

2018-01-17 Thread Peng Li
From: Fuyun Liang 

The manager table is empty by default. If it is not initialized, the
management pkgs like LLDP will be dropped by hardware. Default entries
need to be added to manager table.

Signed-off-by: Fuyun Liang 
Signed-off-by: Peng Li 
---
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h |  22 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 101 +
 2 files changed, 123 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
index 2561e7a..1cd28e0 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
@@ -605,6 +605,28 @@ struct hclge_mac_vlan_mask_entry_cmd {
u8 rsv2[14];
 };
 
+#define HCLGE_MAC_MGR_MASK_VLAN_B  BIT(0)
+#define HCLGE_MAC_MGR_MASK_MAC_B   BIT(1)
+#define HCLGE_MAC_MGR_MASK_ETHERTYPE_B BIT(2)
+#define HCLGE_MAC_ETHERTYPE_LLDP   0x88cc
+
+struct hclge_mac_mgr_tbl_entry_cmd {
+   u8  flags;
+   u8  resp_code;
+   __le16  vlan_tag;
+   __le32  mac_addr_hi32;
+   __le16  mac_addr_lo16;
+   __le16  rsv1;
+   __le16  ethter_type;
+   __le16  egress_port;
+   __le16  egress_queue;
+   u8  sw_port_id_aware;
+   u8  rsv2;
+   u8  i_port_bitmap;
+   u8  i_port_direction;
+   u8  rsv3[2];
+};
+
 #define HCLGE_CFG_MTA_MAC_SEL_S0x0
 #define HCLGE_CFG_MTA_MAC_SEL_MGENMASK(1, 0)
 #define HCLGE_CFG_MTA_MAC_EN_B 0x7
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index c3d2cca..6e64bed 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -392,6 +392,16 @@ static const struct hclge_comm_stats_str 
g_mac_stats_string[] = {
HCLGE_MAC_STATS_FIELD_OFF(mac_rx_send_app_bad_pkt_num)}
 };
 
+static const struct hclge_mac_mgr_tbl_entry_cmd hclge_mgr_table[] = {
+   {
+   .flags = HCLGE_MAC_MGR_MASK_VLAN_B,
+   .ethter_type = cpu_to_le16(HCLGE_MAC_ETHERTYPE_LLDP),
+   .mac_addr_hi32 = cpu_to_le32(htonl(0x0180C200)),
+   .mac_addr_lo16 = cpu_to_le16(htons(0x000E)),
+   .i_port_bitmap = 0x1,
+   },
+};
+
 static int hclge_64_bit_update_stats(struct hclge_dev *hdev)
 {
 #define HCLGE_64_BIT_CMD_NUM 5
@@ -4249,6 +4259,91 @@ int hclge_rm_mc_addr_common(struct hclge_vport *vport,
return status;
 }
 
+static int hclge_get_mac_ethertype_cmd_status(struct hclge_dev *hdev,
+ u16 cmdq_resp, u8 resp_code)
+{
+#define HCLGE_ETHERTYPE_SUCCESS_ADD0
+#define HCLGE_ETHERTYPE_ALREADY_ADD1
+#define HCLGE_ETHERTYPE_MGR_TBL_OVERFLOW   2
+#define HCLGE_ETHERTYPE_KEY_CONFLICT   3
+
+   int return_status;
+
+   if (cmdq_resp) {
+   dev_err(&hdev->pdev->dev,
+   "cmdq execute failed for get_mac_ethertype_cmd_status, 
status=%d.\n",
+   cmdq_resp);
+   return -EIO;
+   }
+
+   switch (resp_code) {
+   case HCLGE_ETHERTYPE_SUCCESS_ADD:
+   case HCLGE_ETHERTYPE_ALREADY_ADD:
+   return_status = 0;
+   break;
+   case HCLGE_ETHERTYPE_MGR_TBL_OVERFLOW:
+   dev_err(&hdev->pdev->dev,
+   "add mac ethertype failed for manager table 
overflow.\n");
+   return_status = -EIO;
+   break;
+   case HCLGE_ETHERTYPE_KEY_CONFLICT:
+   dev_err(&hdev->pdev->dev,
+   "add mac ethertype failed for key conflict.\n");
+   return_status = -EIO;
+   break;
+   default:
+   dev_err(&hdev->pdev->dev,
+   "add mac ethertype failed for undefined, code=%d.\n",
+   resp_code);
+   return_status = -EIO;
+   }
+
+   return return_status;
+}
+
+static int hclge_add_mgr_tbl(struct hclge_dev *hdev,
+const struct hclge_mac_mgr_tbl_entry_cmd *req)
+{
+   struct hclge_desc desc;
+   u8 resp_code;
+   u16 retval;
+   int ret;
+
+   hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_MAC_ETHTYPE_ADD, false);
+   memcpy(desc.data, req, sizeof(struct hclge_mac_mgr_tbl_entry_cmd));
+
+   ret = hclge_cmd_send(&hdev->hw, &desc, 1);
+   if (ret) {
+   dev_err(&hdev->pdev->dev,
+   "add mac ethertype failed for cmd_send, ret =%d.\n",
+   ret);
+   return ret;
+   }
+
+   resp_code = (le32_to_cpu(desc.data[0]) >> 8) & 0xff;
+   retval = le16_to_cpu(desc.retval);
+
+   return hclge_get_mac_ethertype_cmd_status(hdev, retval, resp_code);
+}
+
+static int init_mgr

Re: [PATCH net] gso: validate gso_type if SKB_GSO_DODGY

2018-01-17 Thread Willem de Bruijn
On Thu, Jan 18, 2018 at 12:20 AM, Willem de Bruijn
 wrote:
> On Wed, Jan 17, 2018 at 10:48 PM, Jason Wang  wrote:
>>
>>
>> On 2018年01月18日 07:11, Willem de Bruijn wrote:
>>>
>>> From: Willem de Bruijn
>>>
>>> Validate gso_type of untrusted SKB_GSO_DODGY packets during
>>> segmentation.
>>>
>>> Untrusted user packets are limited to a small set of gso types in
>>> virtio_net_hdr_to_skb. But segmentation occurs on packet contents.
>>> Syzkaller was able to enter gso callbacks that are not hardened
>>> against untrusted user input.
>>>
>>> Fixes: f43798c27684 ("tun: Allow GSO using virtio_net_hdr")
>>
>>
>> This commit is suspicious, I guess it should be 5c7cdf339af5 ("gso: Remove
>> arbitrary checks for unsupported GSO")
>
> The specific SCTP path was introduced with commit 90017accff61 ("sctp:
> add GSO support"). But the main issue that packets can be delivered to
> gso handlers different from their gso_type goes back further.
>
> The commit you reference is actually older than the sctp gso patch, so
> it makes sense that it did not have a check in the sctp_gso_segment.
>
> I still think that we should check in inet_gso_segment when we have
> the proto, instead of in each {tcp, sctp, udp, esp, ...} handler having
> a check of the form.
>
>   !(type & (SKB_GSO_TCPV4 |
> SKB_GSO_TCPV6

Unless we can create packets that legitimate combine
SKB_GSO_DODGY with tunnel headers.

virtio_net_hdr_to_skb does not accept tunneled gso types.

But a tun device can be bridged with a gre tunnel in the
host, creating a path that will call gre_gso_segment.

If that is possible, then this patch is indeed too strict and
we do need checks in the individual handlers.


[PATCH] ixgbe: fix ipv6 support for ipsec offload

2018-01-17 Thread Shannon Nelson
Fix up the Rx path to watch for and decode ipv6 headers that might be
carrying ipsec headers.  To do so, we first change the search function
to be able to take both ipv4 and ipv6 addresses from a pointer, and add
an argument that tells which we are using.  Then in the Rx handler we
add a check for ipv4 vs ipv6 and then parse the headers accordingly.
We can assume simple headers because this device won't decode packets
with vlan or with ipv4/ipv6 extensions.

We also change a flag used in the ...add_sa() function as it seems the
XFRM stack doesn't actually ever set the XFRM_OFFLOAD_IPV6 flag bit.

Signed-off-by: Shannon Nelson 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 39 ++
 1 file changed, 27 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 587fd8f..93eacdd 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -379,12 +379,13 @@ static int ixgbe_ipsec_find_empty_idx(struct ixgbe_ipsec 
*ipsec, bool rxtable)
  * @daddr: inbound address to match
  * @proto: protocol to match
  * @spi: SPI to match
+ * @ip4: true if using an ipv4 address
  *
  * Returns a pointer to the matching SA state information
  **/
 static struct xfrm_state *ixgbe_ipsec_find_rx_state(struct ixgbe_ipsec *ipsec,
-   __be32 daddr, u8 proto,
-   __be32 spi)
+   __be32 *daddr, u8 proto,
+   __be32 spi, bool ip4)
 {
struct rx_sa *rsa;
struct xfrm_state *ret = NULL;
@@ -392,7 +393,9 @@ static struct xfrm_state *ixgbe_ipsec_find_rx_state(struct 
ixgbe_ipsec *ipsec,
rcu_read_lock();
hash_for_each_possible_rcu(ipsec->rx_sa_list, rsa, hlist, spi)
if (spi == rsa->xs->id.spi &&
-   daddr == rsa->xs->id.daddr.a4 &&
+   ((ip4 && *daddr == rsa->xs->id.daddr.a4) ||
+ (!ip4 && !memcmp(daddr, &rsa->xs->id.daddr.a6,
+  sizeof(rsa->xs->id.daddr.a6 &&
proto == rsa->xs->id.proto) {
ret = rsa->xs;
xfrm_state_hold(ret);
@@ -505,7 +508,7 @@ static int ixgbe_ipsec_add_sa(struct xfrm_state *xs)
}
 
/* get ip for rx sa table */
-   if (xs->xso.flags & XFRM_OFFLOAD_IPV6)
+   if (xs->props.family == AF_INET6)
memcpy(rsa.ipaddr, &xs->id.daddr.a6, 16);
else
memcpy(&rsa.ipaddr[3], &xs->id.daddr.a4, 4);
@@ -570,7 +573,7 @@ static int ixgbe_ipsec_add_sa(struct xfrm_state *xs)
rsa.mode |= IXGBE_RXMOD_PROTO_ESP;
if (rsa.decrypt)
rsa.mode |= IXGBE_RXMOD_DECRYPT;
-   if (rsa.xs->xso.flags & XFRM_OFFLOAD_IPV6)
+   if (rsa.xs->props.family == AF_INET6)
rsa.mode |= IXGBE_RXMOD_IPV6;
 
/* the preparations worked, so save the info */
@@ -812,18 +815,30 @@ void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
struct ixgbe_ipsec *ipsec = adapter->ipsec;
struct xfrm_offload *xo = NULL;
struct xfrm_state *xs = NULL;
-   struct iphdr *iph;
-   u8 *c_hdr;
+   struct ipv6hdr *ip6 = NULL;
+   struct iphdr *ip4 = NULL;
+   void *daddr;
__be32 spi;
+   u8 *c_hdr;
u8 proto;
 
-   /* we can assume no vlan header in the way, b/c the
+   /* Find the ip and crypto headers in the data.
+* We can assume no vlan header in the way, b/c the
 * hw won't recognize the IPsec packet and anyway the
 * currently vlan device doesn't support xfrm offload.
 */
-   /* TODO: not supporting IPv6 yet */
-   iph = (struct iphdr *)(skb->data + ETH_HLEN);
-   c_hdr = (u8 *)iph + iph->ihl * 4;
+   if (pkt_info & cpu_to_le16(IXGBE_RXDADV_PKTTYPE_IPV4)) {
+   ip4 = (struct iphdr *)(skb->data + ETH_HLEN);
+   daddr = &ip4->daddr;
+   c_hdr = (u8 *)ip4 + ip4->ihl * 4;
+   } else if (pkt_info & cpu_to_le16(IXGBE_RXDADV_PKTTYPE_IPV6)) {
+   ip6 = (struct ipv6hdr *)(skb->data + ETH_HLEN);
+   daddr = &ip6->daddr;
+   c_hdr = (u8 *)ip6 + sizeof(struct ipv6hdr);
+   } else {
+   return;
+   }
+
switch (pkt_info & ipsec_pkt_types) {
case cpu_to_le16(IXGBE_RXDADV_PKTTYPE_IPSEC_AH):
spi = ((struct ip_auth_hdr *)c_hdr)->spi;
@@ -837,7 +852,7 @@ void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
return;
}
 
-   xs = ixgbe_ipsec_find_rx_state(ipsec, iph->daddr, proto, spi);
+   xs = ixgbe_ipsec_find_rx_state(ipsec, daddr, proto, spi, !

Re: [PATCH net] gso: validate gso_type if SKB_GSO_DODGY

2018-01-17 Thread Willem de Bruijn
On Wed, Jan 17, 2018 at 10:48 PM, Jason Wang  wrote:
>
>
> On 2018年01月18日 07:11, Willem de Bruijn wrote:
>>
>> From: Willem de Bruijn
>>
>> Validate gso_type of untrusted SKB_GSO_DODGY packets during
>> segmentation.
>>
>> Untrusted user packets are limited to a small set of gso types in
>> virtio_net_hdr_to_skb. But segmentation occurs on packet contents.
>> Syzkaller was able to enter gso callbacks that are not hardened
>> against untrusted user input.
>>
>> Fixes: f43798c27684 ("tun: Allow GSO using virtio_net_hdr")
>
>
> This commit is suspicious, I guess it should be 5c7cdf339af5 ("gso: Remove
> arbitrary checks for unsupported GSO")

The specific SCTP path was introduced with commit 90017accff61 ("sctp:
add GSO support"). But the main issue that packets can be delivered to
gso handlers different from their gso_type goes back further.

The commit you reference is actually older than the sctp gso patch, so
it makes sense that it did not have a check in the sctp_gso_segment.

I still think that we should check in inet_gso_segment when we have
the proto, instead of in each {tcp, sctp, udp, esp, ...} handler having
a check of the form.

  !(type & (SKB_GSO_TCPV4 |
SKB_GSO_TCPV6

Note that the above commit also only had these checks in the TSO
branch

  if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {

but the case where this condition fails has to be protected just as
well, so a revert of that patch + new tests in all handlers added
since is not sufficient.

>> Link:http://lkml.kernel.org/r/<001a1137452496ffc305617e5...@google.com>
>> Reported-by:syzbot+fee64147a25aecd48...@syzkaller.appspotmail.com
>> Signed-off-by: Willem de Bruijn
>> ---
>>   net/ipv4/af_inet.c | 21 +
>>   net/ipv6/ip6_offload.c | 23 ++-
>>   2 files changed, 43 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
>> index f00499a46927..d5a36827f7b1 100644
>> --- a/net/ipv4/af_inet.c
>> +++ b/net/ipv4/af_inet.c
>> @@ -1220,6 +1220,25 @@ int inet_sk_rebuild_header(struct sock *sk)
>>   }
>>   EXPORT_SYMBOL(inet_sk_rebuild_header);
>>   +static bool inet_gso_validate_dodgy(struct sk_buff *skb, int ipproto)
>> +{
>> +   unsigned int gso_type = skb_shinfo(skb)->gso_type;
>> +
>> +   if (gso_type & SKB_GSO_DODGY) {
>> +   switch (gso_type & ~SKB_GSO_DODGY) {
>> +   case SKB_GSO_TCPV4:
>> +   case SKB_GSO_TCPV4 | SKB_GSO_TCP_ECN:
>> +   return ipproto == IPPROTO_TCP;
>> +   case SKB_GSO_UDP:
>> +   return ipproto == IPPROTO_UDP;
>> +   default:
>> +   return false;
>> +   }
>> +   }
>> +
>> +   return true;
>> +}
>
>
> This seems more strict than what was removed by 5c7cdf339af5. Any reason for
> this? I'm asking since this probably work for virito-net but I'm not sure it
> works for other sources.

All sources are limited to the same VIRTIO_NET_HDR_GSO_.. types that
virtio_net_hdr_to_skb supports, as far as I know.


Re: [PATCH net] net: validate untrusted gso packets

2018-01-17 Thread Willem de Bruijn
 virtio_net_hdr_to_skb checks gso_type, but it does not verify that this
 type correctly describes the actual packet. Segmentation happens based
 on packet contents. So a packet was crafted to enter sctp gso, even
 though no such gso_type exists. This issue is not specific to sctp.
>>>
>>>
>>> So it looks to me we should do it in here in sctp_gso_segment().
>>>
>>> if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
>>>  /* Packet is from an untrusted source, reset gso_segs. */
>>
>> No dodgy source can legitimately generate sctp code, so it should not
>> even get there.
>
>
> I think that's why we need extra check for dodgy packets here

Commit 5c7cdf339af5 ("gso: Remove arbitrary checks for unsupported
GSO") only compares gso_type against a list of allowed gso types.

It did not compare gso_type against the actual packet contents, so
would not have protected against this vulnerability.

>, we used to
> have gso_type verification here but was removed. In the future, virtio will
> support passing sctp offload packet down for sure, so we need consider this
> case.

Then the sctp handler will first need to be fixed to not crash on bad packets.
And every other handler for which no virtio gso type exists.

>>   Also, a packet can just as easily spoof an esp packet.
>> See also the discussion in the Link above.
>
>
> Then we probably need check there.

Adding a check in every gso handler that does not expect packets
with SKB_GSO_DODGY source seems backwards to me.

A packet with gso_type SKB_GSO_TCPV4 should never be
delivered to an sctp handler, say. We must check before passing
to a handler whether the packet matches the callback.

>>
>> We can address this specific issue in segmentation by placing a check
>> analogous to skb_validate_dodgy_gso in the network layer. Something
>> like this (but with ECN option support):
>>
>> @@ -1258,6 +1258,22 @@ struct sk_buff *inet_gso_segment(struct sk_buff
>> *skb,
>>
>>  skb_reset_transport_header(skb);
>>
>> +   gso_type = skb_shinfo(skb)->gso_type;
>> +   if (gso_type & SKB_GSO_DODGY) {
>> +   switch (gso_type & (SKB_GSO_TCPV4 | SKB_GSO_UDP)) {
>> +   case SKB_GSO_TCPV4:
>> +   if (proto != IPPROTO_TCP)
>> +   goto out;
>> +   break;
>> +   case SKB_GSO_UDP:
>> +   if (proto != IPPROTO_UDP)
>> +   goto out;
>> +   break;
>> +   default:
>> +   goto out;
>> +   }
>> +   }
>
>
> This implements subset of function for codes which was removed by the commit
> I mentioned below.

No, as I explain above, it performs a different check.

>
>>> And we probably need to recover what has been removed since
>>> 5c7cdf339af560f980b12eb6b0b5aa5f68ac6658 ("gso: Remove arbitrary checks
>>> for
>>> unsupported GSO").
>>>
>> User packets can also have corrupted headers, tripping up segmentation
>> logic that expects sane packets from the trusted protocol stack.
>> Hardening all segmentation paths against all bad packets is error
>> prone and slows down the common path, so validate on kernel entry.
>
>
> I think evil packets should be rare in common case, so I'm not sure
> validate
> it on kernel entry is a good choice especially consider we've already
> had
> header check.

 This just makes that check more strict. Frequency of malicious packets
 is
 not really relevant if a single bad packet can cause damage.
>>>
>>>
>>> We try hard to avoid flow dissector since its overhead is obvious. But
>>> looks
>>> like this patch did it unconditionally, and even for non gso packet.
>>>
 The alternative to validate on kernel entry is to harden the entire
 segmentation
 layer and lower part of the stack. That is much harder to get right and
 not
 necessarily cheaper.
>>>
>>>
>>> For performance reason. I think we should delay the check or segmentation
>>> as
>>> much as possible until it was really needed.
>>
>> Going through segmentation is probably as expensive as flow dissector,
>> if not more so because of the indirect branches.
>
>
> I think we don't even need to care about this consider the evil packet
> should be rare.

How does frequency matter when a single packet can crash a host?

> And what you propose here is just a very small subset of the
> necessary checking, more comes at gso header checking. So even if we care
> performance, it only help for some specific case.

It also fixed the bug that Eric sent a separate patch for, as that did
not dissect as a valid TCP packet, either.

>>   And now we have two
>> pieces of code that need to be hardened against malicious packets.
>
>
> To me fixing the exist path is a good balance between security and
> performance.
>
>>
>> But I did overlook the guest to guest communication, with virtual devices
>> that can

Re: [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc

2018-01-17 Thread Al Viro
On Wed, Jan 17, 2018 at 07:16:02PM -0800, Linus Torvalds wrote:
> On Wed, Jan 17, 2018 at 7:06 PM, Al Viro  wrote:
> >
> > Similar to the way put_cmsg() handles 32bit case on biarch
> > targets, introduce a flag telling put_cmsg() to treat
> > ->msg_control as kernel pointer, using memcpy instead of
> > copy_to_user().  That allows to avoid the use of kernel_recvmsg()
> > with its set_fs().
> 
> If this is the only case that kernel_recvmsg() exists for, then by all
> means, that patch certainly looks like a good thing.

In -next that's the only remaining caller.  kernel_recvmsg() is
{
mm_segment_t oldfs = get_fs();
int result;

iov_iter_kvec(&msg->msg_iter, READ | ITER_KVEC, vec, num, size);
set_fs(KERNEL_DS);
result = sock_recvmsg(sock, msg, flags);
set_fs(oldfs);
return result;
}
and 
iov_iter_kvec(&msg->msg_iter, READ | ITER_KVEC, vec, num, size);
result = sock_recvmsg(sock, msg, flags);
works just fine for copying the data - that gets handled by copy_to_iter()
and copy_page_to_iter().  Those don't care about set_fs(); the trouble with
sunrpc call site is that we want to fill msg_control-pointed kernel object.
That gets copied by put_cmsg().

We could turn ->msg_control/->msg_controllen into another
iov_iter, but seeing that we never do scatter-gather for those
IMO that would be a massive overkill.  A flag controlling whether
->msg_control is kernel or userland pointer would do, especially
since we already have a flag for "do we want a native or compat
layout for cmsg" in there.

That's the only caller we need it for, but that thing looks cheap
enough.  Obviously needs to pass testing, including "is it too ugly to
live as far as Davem is concerned" test, though...


Re: [PATCH v2 net 0/2] IB/ipoib: ip link support

2018-01-17 Thread Jason Gunthorpe
On Mon, Jan 15, 2018 at 01:13:09PM -0500, David Miller wrote:
> From: Denis Drozdov 
> Date: Tue,  9 Jan 2018 23:42:45 +0200
> 
> > IP link was broken due to the changes in IPoIB for the rdma_netdev
> > support after commit cd565b4b51e5
> > ("IB/IPoIB: Support acceleration options callbacks").
> > 
> > This patchset restores IPoIB pkey creation and removal using rtnetlink.
> > The first patch introduces changes in the rtnetlink code in order to allow
> > IPOIB allocate and free the netdevice.
> > 
> > The second patch establishes appropriate rtnetlink callbacks for IPoIB
> > device and restores IPoIB netlink support
> > 
> > Changes since v1:
> > - Fixed double free_netdev calls in ops->free_link in netdev_run_todo 
> > - Removed priv_size from ipoib_link_ops as not required anymore.
> 
> Please fix your control flow so that the existing netlink op can do
> the right thing.

Okay. Since this bug has been outstanding in RDMA for so long, I've
looked pretty carefully at this..

This patch does go too far, but ipoib does appear to legitimately need
something special here and no amount of 'control flow fixing' in ipoib
will solve it.

The fundamental issue is that ipoib no longer has a generic software
netdev for the child interface. Instead the child interface is bound
directly to one of several *full* hardware drivers. ie the child and
the master are basically exactly the same HW dependent code.

Since the child is a full hardware netdev, it has its own HW driven
needs for priv_size, txqs, rxqs, etc. There is no 'one size fits all'
value like for all the other software based newlink users. It depends
on the HW device installed in the system (eg mlx4, mlx5, hfi1)

The only problem with rtnl newlink is that it wants to use priv_size,
txqs, and rxqs values from a singleton static global structure (eg
ipoib_link_ops) which cannot know in advance which ipoib HW driver is
in use by the specific parent the child is being created for.

This is what needs to be fixed. ipoib simply needs to have
parent-dependent inputs to the alloc_netdev_mqs in rtnl_create_link.

The patch proposes

  struct net_device *(*alloc_link)(struct net *src_net,
   const char *dev_name,
   struct nlattr *tb[]);

To let ipoib allocate the netdev, and then obviosly figure out the
right parameters internally.

This patch also adds some free related stuff, but that seems to be
going too far. ipoib can use the normal free paths. The above is all
that is necessary.

An alternative would be something like:

  struct rtnl_alloc_params {
   size_t priv_size;
   int txqs;
   int rxqs;
  };
  void (*get_alloc_parameters)(struct net_device *net,
   struct rtnl_alloc_params *parms);

To let ipoib tell rtnl_create_link what parameters to use based on
'net' instead of taking them from the global singleton.

Do you see another choice?

Can you accept one of these options?

Thanks,
Jason
(BTW, Doug & I are now co-maintaining RDMA, so I say the above with my
 maintainer hat on. This is a serious userspace visible regression bug
 on our side, and I really want to see it fixed.)


Re: [PATCH net] flow_dissector: properly cap thoff field

2018-01-17 Thread Jason Wang



On 2018年01月18日 06:21, Eric Dumazet wrote:

From: Eric Dumazet 

syzbot reported yet another crash [1] that is caused by
insufficient validation of DODGY packets.

Two bugs are happening here to trigger the crash.

1) Flow dissection leaves with incorrect thoff field.

2) skb_probe_transport_header() sets transport header to this invalid
thoff, even if pointing after skb valid data.

3) qdisc_pkt_len_init() reads out-of-bound data because it
trusts tcp_hdrlen(skb)

Possible fixes :

- Full flow dissector validation before injecting bad DODGY packets in
the stack.
  This approach was attempted here : https://patchwork.ozlabs.org/patch/
861874/

- Have more robust functions in the core.
   This might be needed anyway for stable versions.

This patch fixes the flow dissection issue.

[1]
CPU: 1 PID: 3144 Comm: syzkaller271204 Not tainted 4.15.0-rc4-mm1+ #49
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x194/0x257 lib/dump_stack.c:53
  print_address_description+0x73/0x250 mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:355 [inline]
  kasan_report+0x23b/0x360 mm/kasan/report.c:413
  __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:432
  __tcp_hdrlen include/linux/tcp.h:35 [inline]
  tcp_hdrlen include/linux/tcp.h:40 [inline]
  qdisc_pkt_len_init net/core/dev.c:3160 [inline]
  __dev_queue_xmit+0x20d3/0x2200 net/core/dev.c:3465
  dev_queue_xmit+0x17/0x20 net/core/dev.c:3554
  packet_snd net/packet/af_packet.c:2943 [inline]
  packet_sendmsg+0x3ad5/0x60a0 net/packet/af_packet.c:2968
  sock_sendmsg_nosec net/socket.c:628 [inline]
  sock_sendmsg+0xca/0x110 net/socket.c:638
  sock_write_iter+0x31a/0x5d0 net/socket.c:907
  call_write_iter include/linux/fs.h:1776 [inline]
  new_sync_write fs/read_write.c:469 [inline]
  __vfs_write+0x684/0x970 fs/read_write.c:482
  vfs_write+0x189/0x510 fs/read_write.c:544
  SYSC_write fs/read_write.c:589 [inline]
  SyS_write+0xef/0x220 fs/read_write.c:581
  entry_SYSCALL_64_fastpath+0x1f/0x96

Fixes: 34fad54c2537 ("net: __skb_flow_dissect() must cap its return value")
Fixes: a6e544b0a88b ("flow_dissector: Jump to exit code in __skb_flow_dissect")
Signed-off-by: Eric Dumazet 
Cc: Willem de Bruijn 
Reported-by: syzbot 
---
  net/core/flow_dissector.c |3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 
15ce300637650e17fcab7e378b20fe7972686d46..544bddf08e13c7f6e47aadc737244c9ba5af56b2
 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -976,8 +976,8 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
  out_good:
ret = true;
  
-	key_control->thoff = (u16)nhoff;

  out:
+   key_control->thoff = min_t(u16, nhoff, skb ? skb->len : hlen);
key_basic->n_proto = proto;
key_basic->ip_proto = ip_proto;
  
@@ -985,7 +985,6 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
  
  out_bad:

ret = false;
-   key_control->thoff = min_t(u16, nhoff, skb ? skb->len : hlen);
goto out;
  }
  EXPORT_SYMBOL(__skb_flow_dissect);



Acked-by: Jason Wang 

I would ask for a CVE for this.

Thanks


Re: [PATCH 2/3] tcp: Add ESP encapsulation support

2018-01-17 Thread Herbert Xu
On Tue, Jan 16, 2018 at 11:28:23AM +0100, Steffen Klassert wrote:
>
> > Is there any chance this can be done with almost no change in TCP
> > stack, using a layer model ? ( net/kcm comes to mind )
> 
> Herbert, would this be an option or is this not possible?

Yes it can be done.  I'm working on it.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH net] gso: validate gso_type if SKB_GSO_DODGY

2018-01-17 Thread Jason Wang



On 2018年01月18日 07:11, Willem de Bruijn wrote:

From: Willem de Bruijn

Validate gso_type of untrusted SKB_GSO_DODGY packets during
segmentation.

Untrusted user packets are limited to a small set of gso types in
virtio_net_hdr_to_skb. But segmentation occurs on packet contents.
Syzkaller was able to enter gso callbacks that are not hardened
against untrusted user input.

Fixes: f43798c27684 ("tun: Allow GSO using virtio_net_hdr")


This commit is suspicious, I guess it should be 5c7cdf339af5 ("gso: 
Remove arbitrary checks for unsupported GSO")



Link:http://lkml.kernel.org/r/<001a1137452496ffc305617e5...@google.com>
Reported-by:syzbot+fee64147a25aecd48...@syzkaller.appspotmail.com
Signed-off-by: Willem de Bruijn
---
  net/ipv4/af_inet.c | 21 +
  net/ipv6/ip6_offload.c | 23 ++-
  2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index f00499a46927..d5a36827f7b1 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1220,6 +1220,25 @@ int inet_sk_rebuild_header(struct sock *sk)
  }
  EXPORT_SYMBOL(inet_sk_rebuild_header);
  
+static bool inet_gso_validate_dodgy(struct sk_buff *skb, int ipproto)

+{
+   unsigned int gso_type = skb_shinfo(skb)->gso_type;
+
+   if (gso_type & SKB_GSO_DODGY) {
+   switch (gso_type & ~SKB_GSO_DODGY) {
+   case SKB_GSO_TCPV4:
+   case SKB_GSO_TCPV4 | SKB_GSO_TCP_ECN:
+   return ipproto == IPPROTO_TCP;
+   case SKB_GSO_UDP:
+   return ipproto == IPPROTO_UDP;
+   default:
+   return false;
+   }
+   }
+
+   return true;
+}


This seems more strict than what was removed by 5c7cdf339af5. Any reason 
for this? I'm asking since this probably work for virito-net but I'm not 
sure it works for other sources.


Thanks


Re: [PATCH v2] r8152: disable RX aggregation on Dell TB16 dock

2018-01-17 Thread Kai Heng Feng

> On 18 Jan 2018, at 11:04 AM, Hayes Wang  wrote:
> 
> [...]
>>> r8153 on Dell TB15/16 dock corrupts rx packets.
>>> 
>>> This change is suggested by Realtek. They guess that the XHCI
>>> controller doesn't have enough buffer, and their guesswork is correct,
>>> once the RX aggregation gets disabled, the issue is gone.
>>> 
>>> ASMedia is currently working on a real sulotion for this issue.
>>> 
>>> Dell and ODM confirm the bcdDevice and iSerialNumber is unique for TB16.
>>> 
>>> Note that TB15 has different bcdDevice and iSerialNumber, which are
>>> not unique values. If you still have TB15, please contact Dell to
>>> replace it with TB16.
> 
> Excuse me. I don't understand why this patch is for specific USB nic rather 
> than xHCI.
> It seems to make the specific USB nic working and the other ones keeping 
> error.

This patch is transient, once ASMedia find the correct solution, the patch
can be reverted.

This patch only targets the builtin 8153 because externally plugged r8152 or
asix do not suffered from this issue.

Kai-Heng

> 
> 
> Best Regards,
> Hayes
> 
> 



Re: [PATCH net] net: validate untrusted gso packets

2018-01-17 Thread Jason Wang



On 2018年01月17日 22:27, Willem de Bruijn wrote:

On Wed, Jan 17, 2018 at 6:54 AM, Jason Wang  wrote:


On 2018年01月17日 12:33, Willem de Bruijn wrote:

On Tue, Jan 16, 2018 at 11:04 PM, Jason Wang  wrote:


On 2018年01月17日 04:29, Willem de Bruijn wrote:

From: Willem de Bruijn

Validate gso packet type and headers on kernel entry. Reuse the info
gathered by skb_probe_transport_header.

Syzbot found two bugs by passing bad gso packets in packet sockets.
Untrusted user packets are limited to a small set of gso types in
virtio_net_hdr_to_skb. But segmentation occurs on packet contents.
Syzkaller was able to enter gso callbacks that are not hardened
against untrusted user input.


Do this mean there's something missed in exist header check for dodgy
packets?

virtio_net_hdr_to_skb checks gso_type, but it does not verify that this
type correctly describes the actual packet. Segmentation happens based
on packet contents. So a packet was crafted to enter sctp gso, even
though no such gso_type exists. This issue is not specific to sctp.


So it looks to me we should do it in here in sctp_gso_segment().

if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
 /* Packet is from an untrusted source, reset gso_segs. */

No dodgy source can legitimately generate sctp code, so it should not
even get there.


I think that's why we need extra check for dodgy packets here, we used 
to have gso_type verification here but was removed. In the future, 
virtio will support passing sctp offload packet down for sure, so we 
need consider this case.



  Also, a packet can just as easily spoof an esp packet.
See also the discussion in the Link above.


Then we probably need check there.



We can address this specific issue in segmentation by placing a check
analogous to skb_validate_dodgy_gso in the network layer. Something
like this (but with ECN option support):

@@ -1258,6 +1258,22 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb,

 skb_reset_transport_header(skb);

+   gso_type = skb_shinfo(skb)->gso_type;
+   if (gso_type & SKB_GSO_DODGY) {
+   switch (gso_type & (SKB_GSO_TCPV4 | SKB_GSO_UDP)) {
+   case SKB_GSO_TCPV4:
+   if (proto != IPPROTO_TCP)
+   goto out;
+   break;
+   case SKB_GSO_UDP:
+   if (proto != IPPROTO_UDP)
+   goto out;
+   break;
+   default:
+   goto out;
+   }
+   }


This implements subset of function for codes which was removed by the 
commit I mentioned below.



And we probably need to recover what has been removed since
5c7cdf339af560f980b12eb6b0b5aa5f68ac6658 ("gso: Remove arbitrary checks for
unsupported GSO").


User packets can also have corrupted headers, tripping up segmentation
logic that expects sane packets from the trusted protocol stack.
Hardening all segmentation paths against all bad packets is error
prone and slows down the common path, so validate on kernel entry.


I think evil packets should be rare in common case, so I'm not sure
validate
it on kernel entry is a good choice especially consider we've already had
header check.

This just makes that check more strict. Frequency of malicious packets is
not really relevant if a single bad packet can cause damage.


We try hard to avoid flow dissector since its overhead is obvious. But looks
like this patch did it unconditionally, and even for non gso packet.


The alternative to validate on kernel entry is to harden the entire
segmentation
layer and lower part of the stack. That is much harder to get right and
not
necessarily cheaper.


For performance reason. I think we should delay the check or segmentation as
much as possible until it was really needed.

Going through segmentation is probably as expensive as flow dissector,
if not more so because of the indirect branches.


I think we don't even need to care about this consider the evil packet 
should be rare. And what you propose here is just a very small subset of 
the necessary checking, more comes at gso header checking. So even if we 
care performance, it only help for some specific case.



  And now we have two
pieces of code that need to be hardened against malicious packets.


To me fixing the exist path is a good balance between security and 
performance.




But I did overlook the guest to guest communication, with virtual devices
that can set NETIF_F_GSO_ROBUST and so do not enter the stack. That
means that one guest can send misrepresented packets to another. Is that
preferable to absorbing the cost of validation?


The problem is that guests and hosts don't trust each other. Even if one 
packet has already been validated by one side, it must be validated 
again by another side when needed. Doing validation twice is meaningless 
in this case.




The current patch does not actually deprecate the path through the
segmentation la

Re: [RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc

2018-01-17 Thread Linus Torvalds
On Wed, Jan 17, 2018 at 7:06 PM, Al Viro  wrote:
>
> Similar to the way put_cmsg() handles 32bit case on biarch
> targets, introduce a flag telling put_cmsg() to treat
> ->msg_control as kernel pointer, using memcpy instead of
> copy_to_user().  That allows to avoid the use of kernel_recvmsg()
> with its set_fs().

If this is the only case that kernel_recvmsg() exists for, then by all
means, that patch certainly looks like a good thing.

 Linus


[PATCH bpf-next 1/8] bpf: arraymap: move checks out of alloc function

2018-01-17 Thread Jakub Kicinski
Use the new callback to perform allocation checks for array maps.
The fd maps don't need a special allocation callback, they only
need a special check callback.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 kernel/bpf/arraymap.c | 42 --
 1 file changed, 28 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index ab94d304a634..68336092bfb4 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -49,27 +49,35 @@ static int bpf_array_alloc_percpu(struct bpf_array *array)
 }
 
 /* Called from syscall */
-static struct bpf_map *array_map_alloc(union bpf_attr *attr)
+static int array_map_alloc_check(union bpf_attr *attr)
 {
bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
int numa_node = bpf_map_attr_numa_node(attr);
-   u32 elem_size, index_mask, max_entries;
-   bool unpriv = !capable(CAP_SYS_ADMIN);
-   struct bpf_array *array;
-   u64 array_size, mask64;
 
/* check sanity of attributes */
if (attr->max_entries == 0 || attr->key_size != 4 ||
attr->value_size == 0 ||
attr->map_flags & ~ARRAY_CREATE_FLAG_MASK ||
(percpu && numa_node != NUMA_NO_NODE))
-   return ERR_PTR(-EINVAL);
+   return -EINVAL;
 
if (attr->value_size > KMALLOC_MAX_SIZE)
/* if value_size is bigger, the user space won't be able to
 * access the elements.
 */
-   return ERR_PTR(-E2BIG);
+   return -E2BIG;
+
+   return 0;
+}
+
+static struct bpf_map *array_map_alloc(union bpf_attr *attr)
+{
+   bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
+   int numa_node = bpf_map_attr_numa_node(attr);
+   u32 elem_size, index_mask, max_entries;
+   bool unpriv = !capable(CAP_SYS_ADMIN);
+   struct bpf_array *array;
+   u64 array_size, mask64;
 
elem_size = round_up(attr->value_size, 8);
 
@@ -327,6 +335,7 @@ static void array_map_free(struct bpf_map *map)
 }
 
 const struct bpf_map_ops array_map_ops = {
+   .map_alloc_check = array_map_alloc_check,
.map_alloc = array_map_alloc,
.map_free = array_map_free,
.map_get_next_key = array_map_get_next_key,
@@ -337,6 +346,7 @@ const struct bpf_map_ops array_map_ops = {
 };
 
 const struct bpf_map_ops percpu_array_map_ops = {
+   .map_alloc_check = array_map_alloc_check,
.map_alloc = array_map_alloc,
.map_free = array_map_free,
.map_get_next_key = array_map_get_next_key,
@@ -345,12 +355,12 @@ const struct bpf_map_ops percpu_array_map_ops = {
.map_delete_elem = array_map_delete_elem,
 };
 
-static struct bpf_map *fd_array_map_alloc(union bpf_attr *attr)
+static int fd_array_map_alloc_check(union bpf_attr *attr)
 {
/* only file descriptors can be stored in this type of map */
if (attr->value_size != sizeof(u32))
-   return ERR_PTR(-EINVAL);
-   return array_map_alloc(attr);
+   return -EINVAL;
+   return array_map_alloc_check(attr);
 }
 
 static void fd_array_map_free(struct bpf_map *map)
@@ -474,7 +484,8 @@ void bpf_fd_array_map_clear(struct bpf_map *map)
 }
 
 const struct bpf_map_ops prog_array_map_ops = {
-   .map_alloc = fd_array_map_alloc,
+   .map_alloc_check = fd_array_map_alloc_check,
+   .map_alloc = array_map_alloc,
.map_free = fd_array_map_free,
.map_get_next_key = array_map_get_next_key,
.map_lookup_elem = fd_array_map_lookup_elem,
@@ -561,7 +572,8 @@ static void perf_event_fd_array_release(struct bpf_map *map,
 }
 
 const struct bpf_map_ops perf_event_array_map_ops = {
-   .map_alloc = fd_array_map_alloc,
+   .map_alloc_check = fd_array_map_alloc_check,
+   .map_alloc = array_map_alloc,
.map_free = fd_array_map_free,
.map_get_next_key = array_map_get_next_key,
.map_lookup_elem = fd_array_map_lookup_elem,
@@ -592,7 +604,8 @@ static void cgroup_fd_array_free(struct bpf_map *map)
 }
 
 const struct bpf_map_ops cgroup_array_map_ops = {
-   .map_alloc = fd_array_map_alloc,
+   .map_alloc_check = fd_array_map_alloc_check,
+   .map_alloc = array_map_alloc,
.map_free = cgroup_fd_array_free,
.map_get_next_key = array_map_get_next_key,
.map_lookup_elem = fd_array_map_lookup_elem,
@@ -610,7 +623,7 @@ static struct bpf_map *array_of_map_alloc(union bpf_attr 
*attr)
if (IS_ERR(inner_map_meta))
return inner_map_meta;
 
-   map = fd_array_map_alloc(attr);
+   map = array_map_alloc(attr);
if (IS_ERR(map)) {
bpf_map_meta_free(inner_map_meta);
return map;
@@ -673,6 +686,7 @@ static u32 array_of_map_gen_lookup(struct bpf_map *map,
 }
 
 const struct bpf_map_ops array_of_maps_map_ops = {
+   .map_alloc_check = fd_array_map_alloc_check,
.map_alloc = array_of_map_alloc,

[PATCH bpf-next 8/8] nfp: bpf: add short busy wait for FW replies

2018-01-17 Thread Jakub Kicinski
Scheduling out and in for every FW message can slow us down
unnecessarily.  Our experiments show that even under heavy load
the FW responds to 99.9% messages within 200 us.  Add a short
busy wait before entering the wait queue.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/cmsg.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c 
b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
index 71e6586acc36..80d3aa0fc9d3 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
@@ -157,7 +157,14 @@ nfp_bpf_cmsg_wait_reply(struct nfp_app_bpf *bpf, enum 
nfp_bpf_cmsg_type type,
int tag)
 {
struct sk_buff *skb;
-   int err;
+   int i, err;
+
+   for (i = 0; i < 50; i++) {
+   udelay(4);
+   skb = nfp_bpf_reply(bpf, tag);
+   if (skb)
+   return skb;
+   }
 
err = wait_event_interruptible_timeout(bpf->cmsg_wq,
   skb = nfp_bpf_reply(bpf, tag),
-- 
2.15.1



[PATCH bpf-next 7/8] selftest/bpf: extend the offload test with map checks

2018-01-17 Thread Jakub Kicinski
Check map device information is reported correctly, and perform
basic map operations.  Check device destruction gets rid of the
maps and map allocation failure path by telling netdevsim to
reject map offload via DebugFS.

Signed-off-by: Jakub Kicinski 
---
 tools/testing/selftests/bpf/Makefile  |   3 +-
 tools/testing/selftests/bpf/sample_map_ret0.c |  34 +
 tools/testing/selftests/bpf/test_offload.py   | 206 +++---
 3 files changed, 218 insertions(+), 25 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/sample_map_ret0.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index a8aa7e251c8e..3a44b655d852 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -19,7 +19,8 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps 
test_lru_map test_lpm_map test
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
-   test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o
+   test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
+   sample_map_ret0.o
 
 TEST_PROGS := test_kmod.sh test_xdp_redirect.sh test_xdp_meta.sh \
test_offload.py
diff --git a/tools/testing/selftests/bpf/sample_map_ret0.c 
b/tools/testing/selftests/bpf/sample_map_ret0.c
new file mode 100644
index ..0756303676ac
--- /dev/null
+++ b/tools/testing/selftests/bpf/sample_map_ret0.c
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) */
+#include 
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") htab = {
+   .type = BPF_MAP_TYPE_HASH,
+   .key_size = sizeof(__u32),
+   .value_size = sizeof(long),
+   .max_entries = 2,
+};
+
+struct bpf_map_def SEC("maps") array = {
+   .type = BPF_MAP_TYPE_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = sizeof(long),
+   .max_entries = 2,
+};
+
+/* Sample program which should always load for testing control paths. */
+SEC(".text") int func()
+{
+   __u64 key64 = 0;
+   __u32 key = 0;
+   long *value;
+
+   value = bpf_map_lookup_elem(&htab, &key);
+   if (!value)
+   return 1;
+   value = bpf_map_lookup_elem(&array, &key64);
+   if (!value)
+   return 1;
+
+   return 0;
+}
diff --git a/tools/testing/selftests/bpf/test_offload.py 
b/tools/testing/selftests/bpf/test_offload.py
index e3c750f17cb8..833b9c1ec450 100755
--- a/tools/testing/selftests/bpf/test_offload.py
+++ b/tools/testing/selftests/bpf/test_offload.py
@@ -20,6 +20,7 @@ import os
 import pprint
 import random
 import string
+import struct
 import subprocess
 import time
 
@@ -156,6 +157,14 @@ netns = [] # net namespaces to be removed
  (len(progs), expected))
 return progs
 
+def bpftool_map_list(expected=None, ns=""):
+_, maps = bpftool("map show", JSON=True, ns=ns, fail=True)
+if expected is not None:
+if len(maps) != expected:
+fail(True, "%d BPF maps loaded, expected %d" %
+ (len(maps), expected))
+return maps
+
 def bpftool_prog_list_wait(expected=0, n_retry=20):
 for i in range(n_retry):
 nprogs = len(bpftool_prog_list())
@@ -164,6 +173,14 @@ netns = [] # net namespaces to be removed
 time.sleep(0.05)
 raise Exception("Time out waiting for program counts to stabilize want %d, 
have %d" % (expected, nprogs))
 
+def bpftool_map_list_wait(expected=0, n_retry=20):
+for i in range(n_retry):
+nmaps = len(bpftool_map_list())
+if nmaps == expected:
+return
+time.sleep(0.05)
+raise Exception("Time out waiting for map counts to stabilize want %d, 
have %d" % (expected, nmaps))
+
 def ip(args, force=False, JSON=True, ns="", fail=True):
 if force:
 args = "-force " + args
@@ -193,6 +210,26 @@ netns = [] # net namespaces to be removed
 return name
 return None
 
+def int2str(fmt, val):
+ret = []
+for b in struct.pack(fmt, val):
+ret.append(int(b))
+return " ".join(map(lambda x: str(x), ret))
+
+def str2int(strtab):
+inttab = []
+for i in strtab:
+inttab.append(int(i, 16))
+ba = bytearray(inttab)
+if len(strtab) == 4:
+fmt = "I"
+elif len(strtab) == 8:
+fmt = "Q"
+else:
+raise Exception("String array of len %d can't be unpacked to an int" %
+(len(strtab)))
+return struct.unpack(fmt, ba)[0]
+
 class DebugfsDir:
 """
 Class for accessing DebugFS directories as a dictionary.
@@ -311,13 +348,13 @@ netns = [] # net namespaces to be removed
 return ip("link set dev %s mtu %d" % (self.dev["ifname"], mtu),
   fail=fail)
 
-def set_xdp(self, bpf, mode, force=False, fail=Tr

[PATCH bpf-next 6/8] netdevsim: bpf: support fake map offload

2018-01-17 Thread Jakub Kicinski
Add to netdevsim ability to pretend it's offloading BPF maps.
We only allow allocation of tiny 2 entry maps, to keep things
simple.  Mutex lock may seem heavy for the operations we
perform, but we want to make sure callbacks can sleep.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/netdevsim/bpf.c   | 246 ++
 drivers/net/netdevsim/netdevsim.h |   3 +
 2 files changed, 249 insertions(+)

diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c
index 5134d5c1306c..b3851bbefad3 100644
--- a/drivers/net/netdevsim/bpf.c
+++ b/drivers/net/netdevsim/bpf.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -31,6 +32,19 @@ struct nsim_bpf_bound_prog {
struct list_head l;
 };
 
+#define NSIM_BPF_MAX_KEYS  2
+
+struct nsim_bpf_bound_map {
+   struct netdevsim *ns;
+   struct bpf_offloaded_map *map;
+   struct mutex mutex;
+   struct nsim_map_entry {
+   void *key;
+   void *value;
+   } entry[NSIM_BPF_MAX_KEYS];
+   struct list_head l;
+};
+
 static int nsim_debugfs_bpf_string_read(struct seq_file *file, void *data)
 {
const char **str = file->private;
@@ -284,6 +298,224 @@ nsim_setup_prog_hw_checks(struct netdevsim *ns, struct 
netdev_bpf *bpf)
return 0;
 }
 
+static bool
+nsim_map_key_match(struct bpf_map *map, struct nsim_map_entry *e, void *key)
+{
+   return e->key && !memcmp(key, e->key, map->key_size);
+}
+
+static int nsim_map_key_find(struct bpf_offloaded_map *offmap, void *key)
+{
+   struct nsim_bpf_bound_map *nmap = offmap->dev_priv;
+   unsigned int i;
+
+   for (i = 0; i < ARRAY_SIZE(nmap->entry); i++)
+   if (nsim_map_key_match(&offmap->map, &nmap->entry[i], key))
+   return i;
+
+   return -ENOENT;
+}
+
+static int
+nsim_map_alloc_elem(struct bpf_offloaded_map *offmap, unsigned int idx)
+{
+   struct nsim_bpf_bound_map *nmap = offmap->dev_priv;
+
+   nmap->entry[idx].key = kmalloc(offmap->map.key_size, GFP_USER);
+   if (!nmap->entry[idx].key)
+   return -ENOMEM;
+   nmap->entry[idx].value = kmalloc(offmap->map.value_size, GFP_USER);
+   if (!nmap->entry[idx].value) {
+   kfree(nmap->entry[idx].key);
+   nmap->entry[idx].key = NULL;
+   return -ENOMEM;
+   }
+
+   return 0;
+}
+
+static int
+nsim_map_get_next_key(struct bpf_offloaded_map *offmap,
+ void *key, void *next_key)
+{
+   struct nsim_bpf_bound_map *nmap = offmap->dev_priv;
+   int idx = -ENOENT;
+
+   mutex_lock(&nmap->mutex);
+
+   if (key)
+   idx = nsim_map_key_find(offmap, key);
+   if (idx == -ENOENT)
+   idx = 0;
+   else
+   idx++;
+
+   for (; idx < ARRAY_SIZE(nmap->entry); idx++) {
+   if (nmap->entry[idx].key) {
+   memcpy(next_key, nmap->entry[idx].key,
+  offmap->map.key_size);
+   break;
+   }
+   }
+
+   mutex_unlock(&nmap->mutex);
+
+   if (idx == ARRAY_SIZE(nmap->entry))
+   return -ENOENT;
+   return 0;
+}
+
+static int
+nsim_map_lookup_elem(struct bpf_offloaded_map *offmap, void *key, void *value)
+{
+   struct nsim_bpf_bound_map *nmap = offmap->dev_priv;
+   int idx;
+
+   mutex_lock(&nmap->mutex);
+
+   idx = nsim_map_key_find(offmap, key);
+   if (idx >= 0)
+   memcpy(value, nmap->entry[idx].value, offmap->map.value_size);
+
+   mutex_unlock(&nmap->mutex);
+
+   return idx < 0 ? idx : 0;
+}
+
+static int
+nsim_map_update_elem(struct bpf_offloaded_map *offmap,
+void *key, void *value, u64 flags)
+{
+   struct nsim_bpf_bound_map *nmap = offmap->dev_priv;
+   int idx, err = 0;
+
+   mutex_lock(&nmap->mutex);
+
+   idx = nsim_map_key_find(offmap, key);
+   if (idx < 0 && flags == BPF_EXIST) {
+   err = idx;
+   goto exit_unlock;
+   }
+   if (idx >= 0 && flags == BPF_NOEXIST) {
+   err = -EEXIST;
+   goto exit_unlock;
+   }
+
+   if (idx < 0) {
+   for (idx = 0; idx < ARRAY_SIZE(nmap->entry); idx++)
+   if (!nmap->entry[idx].key)
+   break;
+   if (idx == ARRAY_SIZE(nmap->entry)) {
+   err = -E2BIG;
+   goto exit_unlock;
+   }
+
+   err = nsim_map_alloc_elem(offmap, idx);
+   if (err)
+   goto exit_unlock;
+   }
+
+   memcpy(nmap->entry[idx].key, key, offmap->map.key_size);
+   memcpy(nmap->entry[idx].value, value, offmap->map.value_size);
+exit_unlock:
+   mutex_unlock(&nmap->mutex);
+
+   return err;
+}
+
+static int nsim_map_delete_elem(struct bpf_offloaded_map *offmap, void *key)
+{
+   stru

[PATCH bpf-next 5/8] tools: bpftool: report device information for offloaded maps

2018-01-17 Thread Jakub Kicinski
Print the information about device on which map is created.

Signed-off-by: Jakub Kicinski 
---
 tools/bpf/bpftool/map.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 8d7db9d6b9cd..a152c1a5c94c 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -428,6 +428,9 @@ static int show_map_close_json(int fd, struct bpf_map_info 
*info)
 
jsonw_name(json_wtr, "flags");
jsonw_printf(json_wtr, "%#x", info->map_flags);
+
+   print_dev_json(info->ifindex, info->netns_dev, info->netns_ino);
+
jsonw_uint_field(json_wtr, "bytes_key", info->key_size);
jsonw_uint_field(json_wtr, "bytes_value", info->value_size);
jsonw_uint_field(json_wtr, "max_entries", info->max_entries);
@@ -469,7 +472,9 @@ static int show_map_close_plain(int fd, struct bpf_map_info 
*info)
if (*info->name)
printf("name %s  ", info->name);
 
-   printf("flags 0x%x\n", info->map_flags);
+   printf("flags 0x%x", info->map_flags);
+   print_dev_plain(info->ifindex, info->netns_dev, info->netns_ino);
+   printf("\n");
printf("\tkey %uB  value %uB  max_entries %u",
   info->key_size, info->value_size, info->max_entries);
 
-- 
2.15.1



[PATCH bpf-next 3/8] bpf: offload: allow array map offload

2018-01-17 Thread Jakub Kicinski
The special handling of different map types is left to the driver.
Allow offload of array maps by simply adding it to accepted types.
For nfp we have to make sure array elements are not deleted.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/offload.c | 2 ++
 kernel/bpf/offload.c | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c 
b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
index c452bf9462e0..1a357aacc444 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
@@ -176,6 +176,8 @@ nfp_bpf_map_get_next_key(struct bpf_offloaded_map *offmap,
 static int
 nfp_bpf_map_delete_elem(struct bpf_offloaded_map *offmap, void *key)
 {
+   if (offmap->map.map_type == BPF_MAP_TYPE_ARRAY)
+   return -EINVAL;
return nfp_bpf_ctrl_del_entry(offmap, key);
 }
 
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 6c0baa1cf8f8..2657976aec2a 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -299,7 +299,8 @@ struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr 
*attr)
 
if (!capable(CAP_SYS_ADMIN))
return ERR_PTR(-EPERM);
-   if (attr->map_type != BPF_MAP_TYPE_HASH)
+   if (attr->map_type != BPF_MAP_TYPE_ARRAY &&
+   attr->map_type != BPF_MAP_TYPE_HASH)
return ERR_PTR(-EINVAL);
 
offmap = kzalloc(sizeof(*offmap), GFP_USER);
-- 
2.15.1



[PATCH bpf-next 4/8] bpf: offload: report device information about offloaded maps

2018-01-17 Thread Jakub Kicinski
Tell user space about device on which the map was created.
Unfortunate reality of user ABI makes sharing this code
with program offload difficult but the information is the
same.

Signed-off-by: Jakub Kicinski 
---
 include/linux/bpf.h|  2 ++
 include/uapi/linux/bpf.h   |  3 +++
 kernel/bpf/offload.c   | 55 ++
 kernel/bpf/syscall.c   |  6 +
 tools/include/uapi/linux/bpf.h |  3 +++
 5 files changed, 69 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 025b1c2f8053..66df387106de 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -586,6 +586,8 @@ void bpf_prog_offload_destroy(struct bpf_prog *prog);
 int bpf_prog_offload_info_fill(struct bpf_prog_info *info,
   struct bpf_prog *prog);
 
+int bpf_map_offload_info_fill(struct bpf_map_info *info, struct bpf_map *map);
+
 int bpf_map_offload_lookup_elem(struct bpf_map *map, void *key, void *value);
 int bpf_map_offload_update_elem(struct bpf_map *map,
void *key, void *value, u64 flags);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7c2259e8bc54..af1f49ad8b88 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -938,6 +938,9 @@ struct bpf_map_info {
__u32 max_entries;
__u32 map_flags;
char  name[BPF_OBJ_NAME_LEN];
+   __u32 ifindex;
+   __u64 netns_dev;
+   __u64 netns_ino;
 } __attribute__((aligned(8)));
 
 /* User bpf_sock_ops struct to access socket values and specify request ops
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 2657976aec2a..c9401075b58c 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -413,6 +413,61 @@ int bpf_map_offload_get_next_key(struct bpf_map *map, void 
*key, void *next_key)
return ret;
 }
 
+struct ns_get_path_bpf_map_args {
+   struct bpf_offloaded_map *offmap;
+   struct bpf_map_info *info;
+};
+
+static struct ns_common *bpf_map_offload_info_fill_ns(void *private_data)
+{
+   struct ns_get_path_bpf_map_args *args = private_data;
+   struct ns_common *ns;
+   struct net *net;
+
+   rtnl_lock();
+   down_read(&bpf_devs_lock);
+
+   if (args->offmap->netdev) {
+   args->info->ifindex = args->offmap->netdev->ifindex;
+   net = dev_net(args->offmap->netdev);
+   get_net(net);
+   ns = &net->ns;
+   } else {
+   args->info->ifindex = 0;
+   ns = NULL;
+   }
+
+   up_read(&bpf_devs_lock);
+   rtnl_unlock();
+
+   return ns;
+}
+
+int bpf_map_offload_info_fill(struct bpf_map_info *info, struct bpf_map *map)
+{
+   struct ns_get_path_bpf_map_args args = {
+   .offmap = map_to_offmap(map),
+   .info   = info,
+   };
+   struct inode *ns_inode;
+   struct path ns_path;
+   void *res;
+
+   res = ns_get_path_cb(&ns_path, bpf_map_offload_info_fill_ns, &args);
+   if (IS_ERR(res)) {
+   if (!info->ifindex)
+   return -ENODEV;
+   return PTR_ERR(res);
+   }
+
+   ns_inode = ns_path.dentry->d_inode;
+   info->netns_dev = new_encode_dev(ns_inode->i_sb->s_dev);
+   info->netns_ino = ns_inode->i_ino;
+   path_put(&ns_path);
+
+   return 0;
+}
+
 bool bpf_offload_dev_match(struct bpf_prog *prog, struct bpf_map *map)
 {
struct bpf_offloaded_map *offmap;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c28524483bf4..abc842aee032 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1799,6 +1799,12 @@ static int bpf_map_get_info_by_fd(struct bpf_map *map,
info.map_flags = map->map_flags;
memcpy(info.name, map->name, sizeof(map->name));
 
+   if (bpf_map_is_dev_bound(map)) {
+   err = bpf_map_offload_info_fill(&info, map);
+   if (err)
+   return err;
+   }
+
if (copy_to_user(uinfo, &info, info_len) ||
put_user(info_len, &uattr->info.info_len))
return -EFAULT;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 69f96af4a569..2d57c73ecacf 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -935,6 +935,9 @@ struct bpf_map_info {
__u32 max_entries;
__u32 map_flags;
char  name[BPF_OBJ_NAME_LEN];
+   __u32 ifindex;
+   __u64 netns_dev;
+   __u64 netns_ino;
 } __attribute__((aligned(8)));
 
 /* User bpf_sock_ops struct to access socket values and specify request ops
-- 
2.15.1



[pull request bpf-next 0/8] bpf: offload: array maps, reporting and test

2018-01-17 Thread Jakub Kicinski
Hi!

This set bring in the rest of map offload code held up by urgent
fixes and improvements to the BPF arrays.

The first 3 patches take care of array map offload, similarly to
hash maps the attribute validation is split out to a separate map
op, and used for both offloaded and non-offloaded case (allocation
only happens if map is on the host).  Offload support comes down
to allowing this map type through the offload check in the core.
NFP driver also rejects the delete operation in case of array maps.

Subsequent patches add reporting of target device in a very similar
way target device of programs is reported (ifindex+netns dev/ino).
Netdevsim is extended with a trivial map implementation allowing us
to test the offload in test_offload.py.

Last patch adds a small busy wait to NFP map IO, this improves the
response times which is especially useful for map dumps.



The following changes since commit e2e3224122e64ebe15fe02a63e8fe09b64a8c743:

  samples/bpf: xdp2skb_meta comment explain why pkt-data pointers are 
invalidated (2018-01-18 01:49:09 +0100)

are available in the Git repository at:

  g...@gitolite.kernel.org:pub/scm/linux/kernel/git/kuba/linux 
bpf-offload-array-maps-and-reporting

for you to fetch changes up to cebe56e2d9efaa883e0ea98b396328ed72fe1c46:

  nfp: bpf: add short busy wait for FW replies (2018-01-17 19:00:12 -0800)


Jakub Kicinski (8):
  bpf: arraymap: move checks out of alloc function
  bpf: arraymap: use bpf_map_init_from_attr()
  bpf: offload: allow array map offload
  bpf: offload: report device information about offloaded maps
  tools: bpftool: report device information for offloaded maps
  netdevsim: bpf: support fake map offload
  selftest/bpf: extend the offload test with map checks
  nfp: bpf: add short busy wait for FW replies

 drivers/net/ethernet/netronome/nfp/bpf/cmsg.c|   9 +-
 drivers/net/ethernet/netronome/nfp/bpf/offload.c |   2 +
 drivers/net/netdevsim/bpf.c  | 246 +++
 drivers/net/netdevsim/netdevsim.h|   3 +
 include/linux/bpf.h  |   2 +
 include/uapi/linux/bpf.h |   3 +
 kernel/bpf/arraymap.c|  49 +++--
 kernel/bpf/offload.c |  58 +-
 kernel/bpf/syscall.c |   6 +
 tools/bpf/bpftool/map.c  |   7 +-
 tools/include/uapi/linux/bpf.h   |   3 +
 tools/testing/selftests/bpf/Makefile |   3 +-
 tools/testing/selftests/bpf/sample_map_ret0.c|  34 
 tools/testing/selftests/bpf/test_offload.py  | 206 ---
 14 files changed, 583 insertions(+), 48 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/sample_map_ret0.c


[PATCH bpf-next 2/8] bpf: arraymap: use bpf_map_init_from_attr()

2018-01-17 Thread Jakub Kicinski
Arraymap was not converted to use bpf_map_init_from_attr()
to avoid merge conflicts with emergency fixes.  Do it now.

Signed-off-by: Jakub Kicinski 
---
 kernel/bpf/arraymap.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 68336092bfb4..b1f66480135b 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -120,12 +120,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr 
*attr)
array->map.unpriv_array = unpriv;
 
/* copy mandatory map attributes */
-   array->map.map_type = attr->map_type;
-   array->map.key_size = attr->key_size;
-   array->map.value_size = attr->value_size;
-   array->map.max_entries = attr->max_entries;
-   array->map.map_flags = attr->map_flags;
-   array->map.numa_node = numa_node;
+   bpf_map_init_from_attr(&array->map, attr);
array->elem_size = elem_size;
 
if (!percpu)
-- 
2.15.1



[RFC][PATCH] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc

2018-01-17 Thread Al Viro
On Wed, Jan 17, 2018 at 06:52:32PM +, Al Viro wrote:

[use of set_fs() by sunrpc]
> We are asking for recvmsg() with zero data length; what we really want is
> ->msg_control.  And _that_ is why we need that set_fs() - we want the damn
> thing to go into local variable.
> 
> But note that filling ->msg_control will happen in put_cmsg(), called
> from ip_cmsg_recv_pktinfo(), called from ip_cmsg_recv_offset(),
> called from udp_recvmsg(), called from sock_recvmsg_nosec(), called
> from sock_recvmsg().  Or in another path in case of IPv6.
> Sure, we can arrange for propagation of that all way down those
> call chains.  My preference would be to try and mark that (one and
> only) case in ->msg_flags, so that put_cmsg() would be able to
> check.  ___sys_recvmsg() sets that as
> msg_sys->msg_flags = flags & (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
> so we ought to be free to use any bit other than those two.  Since
> put_cmsg() already checks ->msg_flags, that shouldn't put too much
> overhead.

Folks, does anybody have objections against the following:

Get rid of kernel_recvmsg() (and thus set_fs()) use in sunrpc

In net/sunrpc/svcsock.c:svc_udp_recvfrom() we want to get
IP_PKTINFO; currently we stash the address of local variable
into ->msg_control (which normall contains a userland pointer
in recepients) and issue zero-length ->recvmsg() under
set_fs(KERNEL_DS).

Similar to the way put_cmsg() handles 32bit case on biarch
targets, introduce a flag telling put_cmsg() to treat
->msg_control as kernel pointer, using memcpy instead of
copy_to_user().  That allows to avoid the use of kernel_recvmsg()
with its set_fs().

All places that might have non-NULL ->msg_control either pass the
msghdr only to ->sendmsg() and its ilk, or have ->msg_flags
sanitized before passing msghdr anywhere.  IOW, there no
way the new flag to reach put_cmsg() in the mainline kernel,
and after this change it will only be seen in sunrpc case.

Incidentally, all other kernel_recvmsg() users are very easy to
convert to sock_recvmsg(), so that would allow to kill
kernel_recvmsg() off...

Signed-off-by: Al Viro 
---
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 9286a5a8c60c..60947da9c806 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -298,6 +298,7 @@ struct ucred {
 #else
 #define MSG_CMSG_COMPAT0   /* We never have 32 bit fixups 
*/
 #endif
+#define MSG_CMSG_KERNEL0x1000
 
 
 /* Setsockoptions(2) level. Thanks to BSD these must match IPPROTO_xxx */
diff --git a/net/core/scm.c b/net/core/scm.c
index b1ff8a441748..1b73b682e441 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -237,10 +237,17 @@ int put_cmsg(struct msghdr * msg, int level, int type, 
int len, void *data)
cmhdr.cmsg_len = cmlen;
 
err = -EFAULT;
-   if (copy_to_user(cm, &cmhdr, sizeof cmhdr))
-   goto out;
-   if (copy_to_user(CMSG_DATA(cm), data, cmlen - sizeof(struct cmsghdr)))
-   goto out;
+   if (unlikely(MSG_CMSG_KERNEL & msg->msg_flags)) {
+   struct cmsghdr *p = msg->msg_control;
+   memcpy(p, &cmhdr, sizeof cmhdr);
+   memcpy(CMSG_DATA(p), data, cmlen - sizeof(struct cmsghdr));
+   } else {
+   if (copy_to_user(cm, &cmhdr, sizeof cmhdr))
+   goto out;
+   if (copy_to_user(CMSG_DATA(cm), data,
+cmlen - sizeof(struct cmsghdr)))
+   goto out;
+   }
cmlen = CMSG_SPACE(len);
if (msg->msg_controllen < cmlen)
cmlen = msg->msg_controllen;
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 5570719e4787..774904f67c93 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -545,7 +545,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
.msg_name = svc_addr(rqstp),
.msg_control = cmh,
.msg_controllen = sizeof(buffer),
-   .msg_flags = MSG_DONTWAIT,
+   .msg_flags = MSG_DONTWAIT | MSG_CMSG_KERNEL,
};
size_t len;
int err;
@@ -565,8 +565,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
 
clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
skb = NULL;
-   err = kernel_recvmsg(svsk->sk_sock, &msg, NULL,
-0, 0, MSG_PEEK | MSG_DONTWAIT);
+   err = sock_recvmsg(svsk->sk_sock, &msg, MSG_PEEK | MSG_DONTWAIT);
if (err >= 0)
skb = skb_recv_udp(svsk->sk_sk, 0, 1, &err);
 


RE: [PATCH v2] r8152: disable RX aggregation on Dell TB16 dock

2018-01-17 Thread Hayes Wang
[...]
> > r8153 on Dell TB15/16 dock corrupts rx packets.
> >
> > This change is suggested by Realtek. They guess that the XHCI
> > controller doesn't have enough buffer, and their guesswork is correct,
> > once the RX aggregation gets disabled, the issue is gone.
> >
> > ASMedia is currently working on a real sulotion for this issue.
> >
> > Dell and ODM confirm the bcdDevice and iSerialNumber is unique for TB16.
> >
> > Note that TB15 has different bcdDevice and iSerialNumber, which are
> > not unique values. If you still have TB15, please contact Dell to
> > replace it with TB16.

Excuse me. I don't understand why this patch is for specific USB nic rather 
than xHCI.
It seems to make the specific USB nic working and the other ones keeping error.


Best Regards,
Hayes




[PATCH net-next 07/12] nfp: add TLV capabilities to the BAR

2018-01-17 Thread Jakub Kicinski
NFP is entirely programmable, including the PCI data interface.
Using a fixed control BAR layout certainly makes implementations
easier, but require careful considerations when space is allocated.
Once BAR area is allocated to one feature nothing else can use it.
Allocating space statically also requires it to be sized upfront,
which leads to either unnecessary limitation or wastage.

We currently have a 32bit capability word defined which tells drivers
which application FW features are supported.   Most of the bits
are exhausted.  The same bits are also reused for enabling specific
features.  Bulk of capabilities don't have a need for an enable bit,
however, leading to confusion and wastage.

TLVs seems like a better fit for expressing capabilities of applications
running on programmable hardware.

This patch leaves the front of the BAR as is, and declares a TLV
capability start at offset 0x58.  Most of the space up to 0x0d90
is already allocated, but the used space can be wrapped with RESERVED
TLVs.  E.g.:

AddressType Length
 0x0058RESERVED  0xe00  /* Wrap basic structures */
 0x0e5cFEATURE_A 0x004
 0x0e64FEATURE_B 0x004
 0x0e6cRESERVED  0x990  /* Wrap qeueue stats */
 0x1800FEATURE_C 0x100

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/Makefile|   1 +
 drivers/net/ethernet/netronome/nfp/nfp_net.h   |   3 +
 .../net/ethernet/netronome/nfp/nfp_net_common.c|   5 +
 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c  | 113 +
 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h  |  67 +++-
 5 files changed, 186 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c

diff --git a/drivers/net/ethernet/netronome/nfp/Makefile 
b/drivers/net/ethernet/netronome/nfp/Makefile
index 064f00e23a19..d5866d708dfa 100644
--- a/drivers/net/ethernet/netronome/nfp/Makefile
+++ b/drivers/net/ethernet/netronome/nfp/Makefile
@@ -22,6 +22,7 @@ nfp-objs := \
nfp_hwmon.o \
nfp_main.o \
nfp_net_common.o \
+   nfp_net_ctrl.o \
nfp_net_debugdump.o \
nfp_net_ethtool.o \
nfp_net_main.o \
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 6f6e3d6fd935..d88eda9707e6 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -578,6 +578,7 @@ struct nfp_net_dp {
  * @qcp_cfg:Pointer to QCP queue used for configuration 
notification
  * @tx_bar: Pointer to mapped TX queues
  * @rx_bar: Pointer to mapped FL/RX queues
+ * @tlv_caps:  Parsed TLV capabilities
  * @debugfs_dir:   Device directory in debugfs
  * @vnic_list: Entry on device vNIC list
  * @pdev:  Backpointer to PCI device
@@ -644,6 +645,8 @@ struct nfp_net {
u8 __iomem *tx_bar;
u8 __iomem *rx_bar;
 
+   struct nfp_net_tlv_caps tlv_caps;
+
struct dentry *debugfs_dir;
 
struct list_head vnic_list;
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 2b5cad3069a7..b47da7ff01dd 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -3815,6 +3815,11 @@ int nfp_net_init(struct nfp_net *nn)
nn->dp.ctrl |= NFP_NET_CFG_CTRL_IRQMOD;
}
 
+   err = nfp_net_tlv_caps_parse(&nn->pdev->dev, nn->dp.ctrl_bar,
+&nn->tlv_caps);
+   if (err)
+   return err;
+
if (nn->dp.netdev)
nfp_net_netdev_init(nn);
 
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c
new file mode 100644
index ..ff155242a665
--- /dev/null
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c
@@ -0,0 +1,113 @@
+/*
+ * Copyright (C) 2018 Netronome Systems, Inc.
+ *
+ * This software is dual licensed under the GNU General License Version 2,
+ * June 1991 as shown in the file COPYING in the top-level directory of this
+ * source tree or the BSD 2-Clause License provided below.  You have the
+ * option to license this software under the complete terms of either license.
+ *
+ * The BSD 2-Clause License:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  1. Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ *  2. Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other 

[PATCH net-next 06/12] nfp: improve app not found message

2018-01-17 Thread Jakub Kicinski
When driver app matching loaded FW is not found users are faced with:

   nfp: failed to find app with ID 0x%02x

This message does not properly explain that matching driver code is
either not built into the driver or the driver is too old.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/nfp_app.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_app.c 
b/drivers/net/ethernet/netronome/nfp/nfp_app.c
index 8ce65d4276b1..6aedef0ad433 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_app.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_app.c
@@ -124,7 +124,7 @@ struct nfp_app *nfp_app_alloc(struct nfp_pf *pf, enum 
nfp_app_id id)
struct nfp_app *app;
 
if (id >= ARRAY_SIZE(apps) || !apps[id]) {
-   nfp_err(pf->cpp, "failed to find app with ID 0x%02hhx\n", id);
+   nfp_err(pf->cpp, "unknown FW app ID 0x%02hhx, driver too old or 
support for FW not built in\n", id);
return ERR_PTR(-EINVAL);
}
 
-- 
2.15.1



[PATCH net-next 05/12] nfp: protect each repr pointer individually with RCU

2018-01-17 Thread Jakub Kicinski
Representors are grouped in sets by type.  Currently the whole
sets are under RCU protection, but individual representor pointers
are not.  This causes some inconveniences when representors have
to be destroyed, because we have to allocate new sets to remove
any representors.  Protect the individual pointers with RCU.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/flower/main.c  | 45 --
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.c | 71 +++
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.h | 15 +++--
 3 files changed, 68 insertions(+), 63 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c 
b/drivers/net/ethernet/netronome/nfp/flower/main.c
index 67c406815365..3c05b637 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.c
@@ -99,7 +99,7 @@ nfp_flower_repr_get(struct nfp_app *app, u32 port_id)
if (port >= reprs->num_reprs)
return NULL;
 
-   return reprs->reprs[port];
+   return rcu_dereference(reprs->reprs[port]);
 }
 
 static int
@@ -114,15 +114,19 @@ nfp_flower_reprs_reify(struct nfp_app *app, enum 
nfp_repr_type type,
if (!reprs)
return 0;
 
-   for (i = 0; i < reprs->num_reprs; i++)
-   if (reprs->reprs[i]) {
-   struct nfp_repr *repr = netdev_priv(reprs->reprs[i]);
+   for (i = 0; i < reprs->num_reprs; i++) {
+   struct net_device *netdev;
+
+   netdev = nfp_repr_get_locked(app, reprs, i);
+   if (netdev) {
+   struct nfp_repr *repr = netdev_priv(netdev);
 
err = nfp_flower_cmsg_portreify(repr, exists);
if (err)
return err;
count++;
}
+   }
 
return count;
 }
@@ -234,19 +238,21 @@ nfp_flower_spawn_vnic_reprs(struct nfp_app *app,
return -ENOMEM;
 
for (i = 0; i < cnt; i++) {
+   struct net_device *repr;
struct nfp_port *port;
u32 port_id;
 
-   reprs->reprs[i] = nfp_repr_alloc(app);
-   if (!reprs->reprs[i]) {
+   repr = nfp_repr_alloc(app);
+   if (!repr) {
err = -ENOMEM;
goto err_reprs_clean;
}
+   RCU_INIT_POINTER(reprs->reprs[i], repr);
 
/* For now we only support 1 PF */
WARN_ON(repr_type == NFP_REPR_TYPE_PF && i);
 
-   port = nfp_port_alloc(app, port_type, reprs->reprs[i]);
+   port = nfp_port_alloc(app, port_type, repr);
if (repr_type == NFP_REPR_TYPE_PF) {
port->pf_id = i;
port->vnic = priv->nn->dp.ctrl_bar;
@@ -257,11 +263,11 @@ nfp_flower_spawn_vnic_reprs(struct nfp_app *app,
app->pf->vf_cfg_mem + i * NFP_NET_CFG_BAR_SZ;
}
 
-   eth_hw_addr_random(reprs->reprs[i]);
+   eth_hw_addr_random(repr);
 
port_id = nfp_flower_cmsg_pcie_port(nfp_pcie, vnic_type,
i, queue);
-   err = nfp_repr_init(app, reprs->reprs[i],
+   err = nfp_repr_init(app, repr,
port_id, port, priv->nn->dp.netdev);
if (err) {
nfp_port_free(port);
@@ -270,7 +276,7 @@ nfp_flower_spawn_vnic_reprs(struct nfp_app *app,
 
nfp_info(app->cpp, "%s%d Representor(%s) created\n",
 repr_type == NFP_REPR_TYPE_PF ? "PF" : "VF", i,
-reprs->reprs[i]->name);
+repr->name);
}
 
nfp_app_reprs_set(app, repr_type, reprs);
@@ -291,7 +297,7 @@ nfp_flower_spawn_vnic_reprs(struct nfp_app *app,
 err_reprs_remove:
reprs = nfp_app_reprs_set(app, repr_type, NULL);
 err_reprs_clean:
-   nfp_reprs_clean_and_free(reprs);
+   nfp_reprs_clean_and_free(app, reprs);
return err;
 }
 
@@ -329,17 +335,18 @@ nfp_flower_spawn_phy_reprs(struct nfp_app *app, struct 
nfp_flower_priv *priv)
 
for (i = 0; i < eth_tbl->count; i++) {
unsigned int phys_port = eth_tbl->ports[i].index;
+   struct net_device *repr;
struct nfp_port *port;
u32 cmsg_port_id;
 
-   reprs->reprs[phys_port] = nfp_repr_alloc(app);
-   if (!reprs->reprs[phys_port]) {
+   repr = nfp_repr_alloc(app);
+   if (!repr) {
err = -ENOMEM;
goto err_reprs_clean;
}
+   RCU_INIT_POINTER(reprs->reprs[phys_port], repr);
 
-   port = nfp_port_alloc(app, NFP_PORT_PHYS_PORT,
- 

[PATCH net-next 04/12] nfp: add nfp_reprs_get_locked() helper

2018-01-17 Thread Jakub Kicinski
The write side of repr tables is always done under pf->lock.
Add a helper to dereference repr table pointers under protection
of that lock.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/nfp_app.c  | 12 ++--
 drivers/net/ethernet/netronome/nfp/nfp_app.h  |  2 ++
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.c |  4 +---
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_app.c 
b/drivers/net/ethernet/netronome/nfp/nfp_app.c
index 955a9f44d244..8ce65d4276b1 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_app.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_app.c
@@ -32,6 +32,8 @@
  */
 
 #include 
+#include 
+#include 
 #include 
 #include 
 
@@ -98,14 +100,20 @@ nfp_app_ctrl_msg_alloc(struct nfp_app *app, unsigned int 
size, gfp_t priority)
return skb;
 }
 
+struct nfp_reprs *
+nfp_reprs_get_locked(struct nfp_app *app, enum nfp_repr_type type)
+{
+   return rcu_dereference_protected(app->reprs[type],
+lockdep_is_held(&app->pf->lock));
+}
+
 struct nfp_reprs *
 nfp_app_reprs_set(struct nfp_app *app, enum nfp_repr_type type,
  struct nfp_reprs *reprs)
 {
struct nfp_reprs *old;
 
-   old = rcu_dereference_protected(app->reprs[type],
-   lockdep_is_held(&app->pf->lock));
+   old = nfp_reprs_get_locked(app, type);
rcu_assign_pointer(app->reprs[type], reprs);
 
return old;
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_app.h 
b/drivers/net/ethernet/netronome/nfp/nfp_app.h
index 6a6eb02b516e..91d469a8e3e6 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_app.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_app.h
@@ -384,6 +384,8 @@ static inline struct net_device *nfp_app_repr_get(struct 
nfp_app *app, u32 id)
 
 struct nfp_app *nfp_app_from_netdev(struct net_device *netdev);
 
+struct nfp_reprs *
+nfp_reprs_get_locked(struct nfp_app *app, enum nfp_repr_type type);
 struct nfp_reprs *
 nfp_app_reprs_set(struct nfp_app *app, enum nfp_repr_type type,
  struct nfp_reprs *reprs);
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
index 317f87cc3cc6..e2452176fa23 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_repr.c
@@ -424,9 +424,7 @@ int nfp_reprs_resync_phys_ports(struct nfp_app *app)
struct nfp_repr *repr;
int i;
 
-   old_reprs =
-   rcu_dereference_protected(app->reprs[NFP_REPR_TYPE_PHYS_PORT],
- lockdep_is_held(&app->pf->lock));
+   old_reprs = nfp_reprs_get_locked(app, NFP_REPR_TYPE_PHYS_PORT);
if (!old_reprs)
return 0;
 
-- 
2.15.1



[PATCH net-next 09/12] nfp: read mailbox address from TLV caps

2018-01-17 Thread Jakub Kicinski
Allow specifying alternative vNIC mailbox location in TLV caps.
This way we can size the mailbox to the needs and not necessarily
waste 512B of ctrl memory space.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 20 +++--
 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c  | 11 ++
 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h  | 25 ++
 3 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 4218a8660d46..15c2fec4f520 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -293,9 +293,15 @@ int nfp_net_reconfig(struct nfp_net *nn, u32 update)
  */
 static int nfp_net_reconfig_mbox(struct nfp_net *nn, u32 mbox_cmd)
 {
+   u32 mbox = nn->tlv_caps.mbox_off;
int ret;
 
-   nn_writeq(nn, NFP_NET_CFG_MBOX_CMD, mbox_cmd);
+   if (!nfp_net_has_mbox(&nn->tlv_caps)) {
+   nn_err(nn, "no mailbox present, command: %u\n", mbox_cmd);
+   return -EIO;
+   }
+
+   nn_writeq(nn, mbox + NFP_NET_CFG_MBOX_SIMPLE_CMD, mbox_cmd);
 
ret = nfp_net_reconfig(nn, NFP_NET_CFG_UPDATE_MBOX);
if (ret) {
@@ -303,7 +309,7 @@ static int nfp_net_reconfig_mbox(struct nfp_net *nn, u32 
mbox_cmd)
return ret;
}
 
-   return -nn_readl(nn, NFP_NET_CFG_MBOX_RET);
+   return -nn_readl(nn, mbox + NFP_NET_CFG_MBOX_SIMPLE_RET);
 }
 
 /* Interrupt configuration and handling
@@ -3084,8 +3090,9 @@ nfp_net_vlan_rx_add_vid(struct net_device *netdev, __be16 
proto, u16 vid)
if (!vid)
return 0;
 
-   nn_writew(nn, NFP_NET_CFG_VLAN_FILTER_VID, vid);
-   nn_writew(nn, NFP_NET_CFG_VLAN_FILTER_PROTO, ETH_P_8021Q);
+   nn_writew(nn, nn->tlv_caps.mbox_off + NFP_NET_CFG_VLAN_FILTER_VID, vid);
+   nn_writew(nn, nn->tlv_caps.mbox_off + NFP_NET_CFG_VLAN_FILTER_PROTO,
+ ETH_P_8021Q);
 
return nfp_net_reconfig_mbox(nn, NFP_NET_CFG_MBOX_CMD_CTAG_FILTER_ADD);
 }
@@ -3101,8 +3108,9 @@ nfp_net_vlan_rx_kill_vid(struct net_device *netdev, 
__be16 proto, u16 vid)
if (!vid)
return 0;
 
-   nn_writew(nn, NFP_NET_CFG_VLAN_FILTER_VID, vid);
-   nn_writew(nn, NFP_NET_CFG_VLAN_FILTER_PROTO, ETH_P_8021Q);
+   nn_writew(nn, nn->tlv_caps.mbox_off + NFP_NET_CFG_VLAN_FILTER_VID, vid);
+   nn_writew(nn, nn->tlv_caps.mbox_off + NFP_NET_CFG_VLAN_FILTER_PROTO,
+ ETH_P_8021Q);
 
return nfp_net_reconfig_mbox(nn, NFP_NET_CFG_MBOX_CMD_CTAG_FILTER_KILL);
 }
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c
index 72da1b352418..ffb402746ad4 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c
@@ -43,6 +43,8 @@ static void nfp_net_tlv_caps_reset(struct nfp_net_tlv_caps 
*caps)
 {
memset(caps, 0, sizeof(*caps));
caps->me_freq_mhz = 1200;
+   caps->mbox_off = NFP_NET_CFG_MBOX_BASE;
+   caps->mbox_len = NFP_NET_CFG_MBOX_VAL_MAX_SZ;
 }
 
 int nfp_net_tlv_caps_parse(struct device *dev, u8 __iomem *ctrl_mem,
@@ -102,6 +104,15 @@ int nfp_net_tlv_caps_parse(struct device *dev, u8 __iomem 
*ctrl_mem,
 
caps->me_freq_mhz = readl(data);
break;
+   case NFP_NET_CFG_TLV_TYPE_MBOX:
+   if (!length) {
+   caps->mbox_off = 0;
+   caps->mbox_len = 0;
+   } else {
+   caps->mbox_off = data - ctrl_mem;
+   caps->mbox_len = length;
+   }
+   break;
default:
if (!FIELD_GET(NFP_NET_CFG_TLV_HEADER_REQUIRED, hdr))
break;
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
index 4c288cdd0e18..eeecef2caac6 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
@@ -413,11 +413,14 @@
  * 4B used for update command and 4B return code
  * followed by a max of 504B of variable length value
  */
-#define NFP_NET_CFG_MBOX_CMD   0x1800
-#define NFP_NET_CFG_MBOX_RET   0x1804
-#define NFP_NET_CFG_MBOX_VAL   0x1808
+#define NFP_NET_CFG_MBOX_BASE  0x1800
 #define NFP_NET_CFG_MBOX_VAL_MAX_SZ0x1F8
 
+#define NFP_NET_CFG_MBOX_SIMPLE_CMD0x0
+#define NFP_NET_CFG_MBOX_SIMPLE_RET0x4
+#define NFP_NET_CFG_MBOX_SIMPLE_VAL0x8
+#define NFP_NET_CFG_MBOX_SIMPLE_LEN0x12
+
 #define NFP_NET_CFG_MBOX_CMD_CTAG_FILTER_ADD 1
 #define NFP_NET_CFG_MBOX_CMD_CTAG_FILTER_KILL 2
 
@@ -428,

[PATCH net-next 02/12] nfp: release global resources only on the remove path

2018-01-17 Thread Jakub Kicinski
NFP app is currently shut down as soon as all the vNICs are gone.
This means we can't depend on the app existing throughout the
lifetime of the device.  Free the app only from PCI remove path.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_main.c | 36 ---
 1 file changed, 13 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
index c505014121c4..e4f4aa5c298e 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
@@ -570,17 +570,6 @@ static int nfp_net_pci_map_mem(struct nfp_pf *pf)
return err;
 }
 
-static void nfp_net_pci_remove_finish(struct nfp_pf *pf)
-{
-   nfp_net_pf_app_stop(pf);
-   /* stop app first, to avoid double free of ctrl vNIC's ddir */
-   nfp_net_debugfs_dir_clean(&pf->ddir);
-
-   nfp_net_pf_free_irqs(pf);
-   nfp_net_pf_app_clean(pf);
-   nfp_net_pci_unmap_mem(pf);
-}
-
 static int
 nfp_net_eth_port_update(struct nfp_cpp *cpp, struct nfp_port *port,
struct nfp_eth_table *eth_table)
@@ -655,9 +644,6 @@ int nfp_net_refresh_port_table_sync(struct nfp_pf *pf)
nfp_net_pf_free_vnic(pf, nn);
}
 
-   if (list_empty(&pf->vnics))
-   nfp_net_pci_remove_finish(pf);
-
return 0;
 }
 
@@ -810,20 +796,24 @@ int nfp_net_pci_probe(struct nfp_pf *pf)
 
 void nfp_net_pci_remove(struct nfp_pf *pf)
 {
-   struct nfp_net *nn;
+   struct nfp_net *nn, *next;
 
mutex_lock(&pf->lock);
-   if (list_empty(&pf->vnics))
-   goto out;
+   list_for_each_entry_safe(nn, next, &pf->vnics, vnic_list) {
+   if (!nfp_net_is_data_vnic(nn))
+   continue;
+   nfp_net_pf_clean_vnic(pf, nn);
+   nfp_net_pf_free_vnic(pf, nn);
+   }
 
-   list_for_each_entry(nn, &pf->vnics, vnic_list)
-   if (nfp_net_is_data_vnic(nn))
-   nfp_net_pf_clean_vnic(pf, nn);
+   nfp_net_pf_app_stop(pf);
+   /* stop app first, to avoid double free of ctrl vNIC's ddir */
+   nfp_net_debugfs_dir_clean(&pf->ddir);
 
-   nfp_net_pf_free_vnics(pf);
+   nfp_net_pf_free_irqs(pf);
+   nfp_net_pf_app_clean(pf);
+   nfp_net_pci_unmap_mem(pf);
 
-   nfp_net_pci_remove_finish(pf);
-out:
mutex_unlock(&pf->lock);
 
cancel_work_sync(&pf->port_refresh_work);
-- 
2.15.1



[PATCH net-next 11/12] nfp: allow apps to disable ctrl vNIC capabilities

2018-01-17 Thread Jakub Kicinski
Most vNIC capabilities are netdev related.  It makes no sense
to initialize them and waste FW resources.  Some are even
counter-productive, like IRQ moderation, which will slow
down exchange of control messages.

Add to nfp_app a mask of enabled control vNIC capabilities
for apps to use.  Make flower and BPF enable all capabilities
for now.  No functional changes.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.c   | 2 ++
 drivers/net/ethernet/netronome/nfp/flower/main.c| 2 ++
 drivers/net/ethernet/netronome/nfp/nfp_app.h| 4 
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 4 
 4 files changed, 12 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index 8823c8360047..5f021d0c88a4 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -389,6 +389,8 @@ const struct nfp_app_type app_bpf = {
.id = NFP_APP_BPF_NIC,
.name   = "ebpf",
 
+   .ctrl_cap_mask  = ~0U,
+
.init   = nfp_bpf_init,
.clean  = nfp_bpf_clean,
 
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c 
b/drivers/net/ethernet/netronome/nfp/flower/main.c
index 3c05b637..742d6f1575b5 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.c
@@ -565,6 +565,8 @@ static void nfp_flower_stop(struct nfp_app *app)
 const struct nfp_app_type app_flower = {
.id = NFP_APP_FLOWER_NIC,
.name   = "flower",
+
+   .ctrl_cap_mask  = ~0U,
.ctrl_has_meta  = true,
 
.extra_cap  = nfp_flower_extra_cap,
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_app.h 
b/drivers/net/ethernet/netronome/nfp/nfp_app.h
index 91d469a8e3e6..7e474df90598 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_app.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_app.h
@@ -66,6 +66,9 @@ extern const struct nfp_app_type app_flower;
  * struct nfp_app_type - application definition
  * @id:application ID
  * @name:  application name
+ * @ctrl_cap_mask:  ctrl vNIC capability mask, allows disabling features like
+ * IRQMOD which are on by default but counter-productive for
+ * control messages which are often latency-sensitive
  * @ctrl_has_meta:  control messages have prepend of type:5/port:CTRL
  *
  * Callbacks
@@ -100,6 +103,7 @@ struct nfp_app_type {
enum nfp_app_id id;
const char *name;
 
+   u32 ctrl_cap_mask;
bool ctrl_has_meta;
 
int (*init)(struct nfp_app *app);
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 86a88770181d..cdf52421eaca 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -3790,6 +3790,10 @@ static int nfp_net_read_caps(struct nfp_net *nn)
nn->dp.rx_offset = NFP_NET_RX_OFFSET;
}
 
+   /* For control vNICs mask out the capabilities app doesn't want. */
+   if (!nn->dp.netdev)
+   nn->cap &= nn->app->type->ctrl_cap_mask;
+
return 0;
 }
 
-- 
2.15.1



[PATCH net-next 10/12] nfp: split reading capabilities out of nfp_net_init()

2018-01-17 Thread Jakub Kicinski
nfp_net_init() is a little long and we are about to add more
code to reading capabilties.  Move the capability reading,
parsing and validating out.  Only actual initialization
will stay in nfp_net_init().

No functional changes.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 31 ++
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 15c2fec4f520..86a88770181d 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -3756,18 +3756,8 @@ static void nfp_net_netdev_init(struct nfp_net *nn)
nfp_net_set_ethtool_ops(netdev);
 }
 
-/**
- * nfp_net_init() - Initialise/finalise the nfp_net structure
- * @nn:NFP Net device structure
- *
- * Return: 0 on success or negative errno on error.
- */
-int nfp_net_init(struct nfp_net *nn)
+static int nfp_net_read_caps(struct nfp_net *nn)
 {
-   int err;
-
-   nn->dp.rx_dma_dir = DMA_FROM_DEVICE;
-
/* Get some of the read-only fields from the BAR */
nn->cap = nn_readl(nn, NFP_NET_CFG_CAP);
nn->max_mtu = nn_readl(nn, NFP_NET_CFG_MAX_MTU);
@@ -3800,6 +3790,25 @@ int nfp_net_init(struct nfp_net *nn)
nn->dp.rx_offset = NFP_NET_RX_OFFSET;
}
 
+   return 0;
+}
+
+/**
+ * nfp_net_init() - Initialise/finalise the nfp_net structure
+ * @nn:NFP Net device structure
+ *
+ * Return: 0 on success or negative errno on error.
+ */
+int nfp_net_init(struct nfp_net *nn)
+{
+   int err;
+
+   nn->dp.rx_dma_dir = DMA_FROM_DEVICE;
+
+   err = nfp_net_read_caps(nn);
+   if (err)
+   return err;
+
/* Set default MTU and Freelist buffer size */
if (nn->max_mtu < NFP_NET_DEFAULT_MTU)
nn->dp.mtu = nn->max_mtu;
-- 
2.15.1



[PATCH net-next 08/12] nfp: read ME frequency from vNIC ctrl memory

2018-01-17 Thread Jakub Kicinski
PCIe island clock frequency is used when converting coalescing
parameters from usecs to NFP timestamps.  Most chips don't run
at 1200MHz, allow FW to provide us with the real frequency.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c |  2 +-
 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c   | 11 +++
 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h   |  7 +++
 drivers/net/ethernet/netronome/nfp/nfp_net_main.c   |  6 --
 drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c |  6 --
 5 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index b47da7ff01dd..4218a8660d46 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -2458,7 +2458,7 @@ void nfp_net_coalesce_write_cfg(struct nfp_net *nn)
 * ME timestamp ticks.  There are 16 ME clock cycles for each timestamp
 * count.
 */
-   factor = nn->me_freq_mhz / 16;
+   factor = nn->tlv_caps.me_freq_mhz / 16;
 
/* copy RX interrupt coalesce parameters */
value = (nn->rx_coalesce_max_frames << 16) |
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c
index ff155242a665..72da1b352418 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c
@@ -42,6 +42,7 @@
 static void nfp_net_tlv_caps_reset(struct nfp_net_tlv_caps *caps)
 {
memset(caps, 0, sizeof(*caps));
+   caps->me_freq_mhz = 1200;
 }
 
 int nfp_net_tlv_caps_parse(struct device *dev, u8 __iomem *ctrl_mem,
@@ -91,6 +92,16 @@ int nfp_net_tlv_caps_parse(struct device *dev, u8 __iomem 
*ctrl_mem,
dev_err(dev, "END TLV should be empty, has len:%d\n",
length);
return -EINVAL;
+   case NFP_NET_CFG_TLV_TYPE_ME_FREQ:
+   if (length != 4) {
+   dev_err(dev,
+   "ME FREQ TLV should be 4B, is %dB\n",
+   length);
+   return -EINVAL;
+   }
+
+   caps->me_freq_mhz = readl(data);
+   break;
default:
if (!FIELD_GET(NFP_NET_CFG_TLV_HEADER_REQUIRED, hdr))
break;
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
index cb058050e079..4c288cdd0e18 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
@@ -474,17 +474,24 @@
  * %NFP_NET_CFG_TLV_TYPE_END:
  * Empty, end of TLV list.  Must be the last TLV.  Drivers will stop processing
  * further TLVs when encountered.
+ *
+ * %NFP_NET_CFG_TLV_TYPE_ME_FREQ:
+ * Single word, ME frequency in MHz as used in calculation for
+ * %NFP_NET_CFG_RXR_IRQ_MOD and %NFP_NET_CFG_TXR_IRQ_MOD.
  */
 #define NFP_NET_CFG_TLV_TYPE_UNKNOWN   0
 #define NFP_NET_CFG_TLV_TYPE_RESERVED  1
 #define NFP_NET_CFG_TLV_TYPE_END   2
+#define NFP_NET_CFG_TLV_TYPE_ME_FREQ   3
 
 struct device;
 
 /**
  * struct nfp_net_tlv_caps - parsed control BAR TLV capabilities
+ * @me_freq_mhz:   ME clock_freq (MHz)
  */
 struct nfp_net_tlv_caps {
+   u32 me_freq_mhz;
 };
 
 int nfp_net_tlv_caps_parse(struct device *dev, u8 __iomem *ctrl_mem,
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
index fd9554002fca..15fa47f622aa 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
@@ -208,12 +208,6 @@ nfp_net_pf_init_vnic(struct nfp_pf *pf, struct nfp_net 
*nn, unsigned int id)
 {
int err;
 
-   /* Get ME clock frequency from ctrl BAR
-* XXX for now frequency is hardcoded until we figure out how
-* to get the value from nfp-hwinfo into ctrl bar
-*/
-   nn->me_freq_mhz = 1200;
-
err = nfp_net_init(nn);
if (err)
return err;
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c
index c879626e035b..b802a1d55449 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c
@@ -277,12 +277,6 @@ static int nfp_netvf_pci_probe(struct pci_dev *pdev,
}
nfp_net_irqs_assign(nn, vf->irq_entries, num_irqs);
 
-   /* Get ME clock frequency from ctrl BAR
-* XXX for now frequency is hardcoded until we figure out how
-* to get the value from nfp-hwinfo into ctrl bar
-*/
-   nn->me_

[PATCH net-next 01/12] nfp: core: make scalar CPP helpers fail on short accesses

2018-01-17 Thread Jakub Kicinski
Currently the helpers for accessing 4 or 8 byte values over
the CPP bus return the length of IO on success.  If the IO
was short caller has to deal with error handling.  The short
IO for 4/8B values is completely impractical.  Make the
helpers return an error if full access was not possible.
Fix the few places which are actually dealing with errors
correctly, most call sites already only deal with negative
return codes.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 .../net/ethernet/netronome/nfp/nfp_net_debugdump.c | 19 +-
 .../ethernet/netronome/nfp/nfpcore/nfp_cppcore.c   | 42 +-
 .../ethernet/netronome/nfp/nfpcore/nfp_cpplib.c| 38 
 .../net/ethernet/netronome/nfp/nfpcore/nfp_rtsym.c |  4 ---
 4 files changed, 58 insertions(+), 45 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c
index 173646e17e94..e6f19f44b461 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_debugdump.c
@@ -518,16 +518,15 @@ nfp_dump_csr_range(struct nfp_pf *pf, struct 
nfp_dumpspec_csr *spec_csr,
max_rd_addr = cpp_rd_addr + be32_to_cpu(spec_csr->cpp.dump_length);
 
while (cpp_rd_addr < max_rd_addr) {
-   if (is_xpb_read(&spec_csr->cpp.cpp_id))
-   bytes_read = nfp_xpb_readl(pf->cpp, cpp_rd_addr,
-  (u32 *)dest);
-   else
+   if (is_xpb_read(&spec_csr->cpp.cpp_id)) {
+   err = nfp_xpb_readl(pf->cpp, cpp_rd_addr, (u32 *)dest);
+   } else {
bytes_read = nfp_cpp_read(pf->cpp, cpp_id, cpp_rd_addr,
  dest, reg_sz);
-   if (bytes_read != reg_sz) {
-   if (bytes_read >= 0)
-   bytes_read = -EIO;
-   dump_header->error = cpu_to_be32(bytes_read);
+   err = bytes_read == reg_sz ? 0 : -EIO;
+   }
+   if (err) {
+   dump_header->error = cpu_to_be32(err);
dump_header->error_offset = cpu_to_be32(cpp_rd_addr);
break;
}
@@ -555,8 +554,8 @@ nfp_read_indirect_csr(struct nfp_cpp *cpp,
   NFP_IND_ME_REFL_WR_SIG_INIT,
   cpp_params.token, cpp_params.island);
result = nfp_cpp_writel(cpp, cpp_id, csr_ctx_ptr_offs, context);
-   if (result != sizeof(context))
-   return result < 0 ? result : -EIO;
+   if (result)
+   return result;
 
cpp_id = nfp_get_numeric_cpp_id(&cpp_params);
result = nfp_cpp_read(cpp, cpp_id, csr_ctx_ptr_offs, dest, reg_sz);
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c
index 28262470dabf..ef30597aa319 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cppcore.c
@@ -674,18 +674,20 @@ void __iomem *nfp_cpp_area_iomem(struct nfp_cpp_area 
*area)
  * @offset:Offset into area
  * @value: Pointer to read buffer
  *
- * Return: length of the io, or -ERRNO
+ * Return: 0 on success, or -ERRNO
  */
 int nfp_cpp_area_readl(struct nfp_cpp_area *area,
   unsigned long offset, u32 *value)
 {
u8 tmp[4];
-   int err;
+   int n;
 
-   err = nfp_cpp_area_read(area, offset, &tmp, sizeof(tmp));
-   *value = get_unaligned_le32(tmp);
+   n = nfp_cpp_area_read(area, offset, &tmp, sizeof(tmp));
+   if (n != sizeof(tmp))
+   return n < 0 ? n : -EIO;
 
-   return err;
+   *value = get_unaligned_le32(tmp);
+   return 0;
 }
 
 /**
@@ -694,16 +696,18 @@ int nfp_cpp_area_readl(struct nfp_cpp_area *area,
  * @offset:Offset into area
  * @value: Value to write
  *
- * Return: length of the io, or -ERRNO
+ * Return: 0 on success, or -ERRNO
  */
 int nfp_cpp_area_writel(struct nfp_cpp_area *area,
unsigned long offset, u32 value)
 {
u8 tmp[4];
+   int n;
 
put_unaligned_le32(value, tmp);
+   n = nfp_cpp_area_write(area, offset, &tmp, sizeof(tmp));
 
-   return nfp_cpp_area_write(area, offset, &tmp, sizeof(tmp));
+   return n == sizeof(tmp) ? 0 : n < 0 ? n : -EIO;
 }
 
 /**
@@ -712,18 +716,20 @@ int nfp_cpp_area_writel(struct nfp_cpp_area *area,
  * @offset:Offset into area
  * @value: Pointer to read buffer
  *
- * Return: length of the io, or -ERRNO
+ * Return: 0 on success, or -ERRNO
  */
 int nfp_cpp_area_readq(struct nfp_cpp_area *area,
   unsigned long offset, u64 *value)
 {
u8 tmp[8];
-   int err;
+   int n;
 
-   err = nfp_cpp_area_read(area, offset

[PATCH net-next 03/12] nfp: register devlink after app is created

2018-01-17 Thread Jakub Kicinski
Devlink used to have two global locks: devlink lock and port lock,
our lock ordering looked like this:

  devlink lock -> driver's pf->lock -> devlink port lock

After recent changes port lock was replaced with per-instance
lock.  Unfortunately, new per-instance lock is taken on most
operations now.  This means we can only grab the pf->lock from
the port split/unsplit ops.  Lock ordering looks like this:

  devlink lock -> driver's pf->lock -> devlink instance lock

Since we can't take pf->lock from most devlink ops, make sure
nfp_apps are prepared to service them as soon as devlink is
registered.  Locking the pf must be pushed down after
nfp_app_init() callback.

The init order looks like this:
 nfp_app_init
 devlink_register
 nfp_app_start
 netdev/port_register

As soon as app_init is done nfp_apps must be ready to service
devlink-related callbacks.  apps can only register their own
devlink objects from nfp_app_start.

Fixes: 2406e7e546b2 ("devlink: Add per devlink instance lock")
Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/nfp_devlink.c  | 12 +---
 drivers/net/ethernet/netronome/nfp/nfp_main.c | 17 ++-
 drivers/net/ethernet/netronome/nfp/nfp_net_main.c | 36 ---
 3 files changed, 29 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_devlink.c 
b/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
index 6c9f29c2e975..eb0fc614673d 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
@@ -152,18 +152,8 @@ nfp_devlink_port_unsplit(struct devlink *devlink, unsigned 
int port_index)
 static int nfp_devlink_eswitch_mode_get(struct devlink *devlink, u16 *mode)
 {
struct nfp_pf *pf = devlink_priv(devlink);
-   int ret;
-
-   mutex_lock(&pf->lock);
-   if (!pf->app) {
-   ret = -EBUSY;
-   goto out;
-   }
-   ret = nfp_app_eswitch_mode_get(pf->app, mode);
-out:
-   mutex_unlock(&pf->lock);
 
-   return ret;
+   return nfp_app_eswitch_mode_get(pf->app, mode);
 }
 
 const struct devlink_ops nfp_devlink_ops = {
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_main.c
index 0953fa8f3109..c5b91040b12e 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
@@ -499,13 +499,9 @@ static int nfp_pci_probe(struct pci_dev *pdev,
if (err)
goto err_hwinfo_free;
 
-   err = devlink_register(devlink, &pdev->dev);
-   if (err)
-   goto err_hwinfo_free;
-
err = nfp_nsp_init(pdev, pf);
if (err)
-   goto err_devlink_unreg;
+   goto err_hwinfo_free;
 
pf->mip = nfp_mip_open(pf->cpp);
pf->rtbl = __nfp_rtsym_table_read(pf->cpp, pf->mip);
@@ -549,8 +545,6 @@ static int nfp_pci_probe(struct pci_dev *pdev,
kfree(pf->eth_tbl);
kfree(pf->nspi);
vfree(pf->dumpspec);
-err_devlink_unreg:
-   devlink_unregister(devlink);
 err_hwinfo_free:
kfree(pf->hwinfo);
nfp_cpp_free(pf->cpp);
@@ -571,18 +565,13 @@ static int nfp_pci_probe(struct pci_dev *pdev,
 static void nfp_pci_remove(struct pci_dev *pdev)
 {
struct nfp_pf *pf = pci_get_drvdata(pdev);
-   struct devlink *devlink;
 
nfp_hwmon_unregister(pf);
 
-   devlink = priv_to_devlink(pf);
-
-   nfp_net_pci_remove(pf);
-
nfp_pcie_sriov_disable(pdev);
pci_sriov_set_totalvfs(pf->pdev, 0);
 
-   devlink_unregister(devlink);
+   nfp_net_pci_remove(pf);
 
vfree(pf->dumpspec);
kfree(pf->rtbl);
@@ -598,7 +587,7 @@ static void nfp_pci_remove(struct pci_dev *pdev)
kfree(pf->eth_tbl);
kfree(pf->nspi);
mutex_destroy(&pf->lock);
-   devlink_free(devlink);
+   devlink_free(priv_to_devlink(pf));
pci_release_regions(pdev);
pci_disable_device(pdev);
 }
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
index e4f4aa5c298e..fd9554002fca 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
@@ -373,7 +373,9 @@ nfp_net_pf_app_init(struct nfp_pf *pf, u8 __iomem *qc_bar, 
unsigned int stride)
if (IS_ERR(pf->app))
return PTR_ERR(pf->app);
 
+   mutex_lock(&pf->lock);
err = nfp_app_init(pf->app);
+   mutex_unlock(&pf->lock);
if (err)
goto err_free;
 
@@ -401,7 +403,9 @@ nfp_net_pf_app_init(struct nfp_pf *pf, u8 __iomem *qc_bar, 
unsigned int stride)
 err_unmap:
nfp_cpp_area_release_free(pf->ctrl_vnic_bar);
 err_app_clean:
+   mutex_lock(&pf->lock);
nfp_app_clean(pf->app);
+   mutex_unlock(&pf->lock);
 err_free:
nfp_app_free(pf->app);
pf->app = NULL;
@@ -414,7 +418,11 @@ static void nfp_n

[PATCH net-next 12/12] nfp: bpf: disable all ctrl vNIC capabilities

2018-01-17 Thread Jakub Kicinski
BPF firmware currently exposes IRQ moderation capability.
The driver will make use of it by default, inserting 50 usec
delay to every control message exchange.  This cuts the number
of messages per second we can exchange by almost half.

None of the other capabilities make much sense for BPF control
vNIC, either.  Disable them all.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index 5f021d0c88a4..4ee11bf2aed7 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -389,7 +389,7 @@ const struct nfp_app_type app_bpf = {
.id = NFP_APP_BPF_NIC,
.name   = "ebpf",
 
-   .ctrl_cap_mask  = ~0U,
+   .ctrl_cap_mask  = 0,
 
.init   = nfp_bpf_init,
.clean  = nfp_bpf_clean,
-- 
2.15.1



[PATCH net-next 00/12] nfp: devlink, capabilities extensions and updates

2018-01-17 Thread Jakub Kicinski
Hi!

This series starts with an improvement to the usability of the device
memory accessors (CPP transactions).  Next few patches are devoted to
fixing the devlink locking.  After recent patches for mlxsw the locking
scheme of devlink ops has to be reworked.  Following patches improve
NFP code dealing with "representors", and expands the error message
printed when driver has no support for loaded FW.

Second part of the series is focused on vNIC capabilities read from
vNIC control memory (often referred to as "BAR0" for historical reasons).
TLV capability format is established and immediately made use of.  The
next patches rework parsing of features for control vNIC which allows
apps to mask out features they don't want enabled.


Jakub Kicinski (12):
  nfp: core: make scalar CPP helpers fail on short accesses
  nfp: release global resources only on the remove path
  nfp: register devlink after app is created
  nfp: add nfp_reprs_get_locked() helper
  nfp: protect each repr pointer individually with RCU
  nfp: improve app not found message
  nfp: add TLV capabilities to the BAR
  nfp: read ME frequency from vNIC ctrl memory
  nfp: read mailbox address from TLV caps
  nfp: split reading capabilities out of nfp_net_init()
  nfp: allow apps to disable ctrl vNIC capabilities
  nfp: bpf: disable all ctrl vNIC capabilities

 drivers/net/ethernet/netronome/nfp/Makefile|   1 +
 drivers/net/ethernet/netronome/nfp/bpf/main.c  |   2 +
 drivers/net/ethernet/netronome/nfp/flower/main.c   |  47 ---
 drivers/net/ethernet/netronome/nfp/nfp_app.c   |  14 ++-
 drivers/net/ethernet/netronome/nfp/nfp_app.h   |   6 +
 drivers/net/ethernet/netronome/nfp/nfp_devlink.c   |  12 +-
 drivers/net/ethernet/netronome/nfp/nfp_main.c  |  17 +--
 drivers/net/ethernet/netronome/nfp/nfp_net.h   |   3 +
 .../net/ethernet/netronome/nfp/nfp_net_common.c|  62 +++---
 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c  | 135 +
 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h  |  99 +--
 .../net/ethernet/netronome/nfp/nfp_net_debugdump.c |  19 ++-
 drivers/net/ethernet/netronome/nfp/nfp_net_main.c  |  76 ++--
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.c  |  73 +--
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.h  |  15 ++-
 .../net/ethernet/netronome/nfp/nfp_netvf_main.c|   6 -
 .../ethernet/netronome/nfp/nfpcore/nfp_cppcore.c   |  42 ---
 .../ethernet/netronome/nfp/nfpcore/nfp_cpplib.c|  38 +++---
 .../net/ethernet/netronome/nfp/nfpcore/nfp_rtsym.c |   4 -
 19 files changed, 463 insertions(+), 208 deletions(-)
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.c

-- 
2.15.1



Re: dangers of bots on the mailing lists was Re: divide error in ___bpf_prog_run

2018-01-17 Thread Joe Perches
On Wed, 2018-01-17 at 17:46 -0800, Eric Biggers wrote:
> On Wed, Jan 17, 2018 at 05:18:17PM -0800, Joe Perches wrote:
> > On Wed, 2018-01-17 at 20:09 -0500, Theodore Ts'o wrote:
> > > get_maintainer.pl, which is often not accurate
> > 
> > Examples please.
> > 
> 
> Well, the primary problem is that place the crash occurs is not necessarily
> responsible for the bug.  But, syzbot actually does have a file blacklist for
> exactly that reason; see
> https://github.com/google/syzkaller/blob/master/pkg/report/linux.go#L56

Which has no association to a problem with get_maintainer.

> And yes, get_maintainer.pl sometimes isn't accurate even if the offending code
> is correctly identified.  That's more of a community problem, e.g. people
> sometimes don't bother to remove themselves from MAINTAINERS when they quit
> maintaining, and sometimes people don't feel responsible enough for a file to
> add themselves to MAINTAINERS, even when in practice they are actually taking
> most of the patches to it through their tree.

Yup, not a get_maintainer problem.

There are more than 1800 sections and more than
1200 individual names in the MAINTAINERS file.

In practice, there are a few dozen maintainers
that are upstream patch paths.



RE: [PATCH] net: fec: add necessary defines to work on ARM64

2018-01-17 Thread Andy Duan
From: Lucas Stach  Sent: Thursday, January 18, 2018 
2:31 AM
>The i.MX8 is a ARMv8 based SoC, that uses the same FEC IP as the earlier,
>ARMv7 based, i.MX SoCs. Allow the driver to work on ARM64.
>
>Signed-off-by: Lucas Stach 
>---
> drivers/net/ethernet/freescale/fec.h  | 5 +++--
> drivers/net/ethernet/freescale/fec_main.c | 8 +---
> 2 files changed, 8 insertions(+), 5 deletions(-)
>
>diff --git a/drivers/net/ethernet/freescale/fec.h
>b/drivers/net/ethernet/freescale/fec.h
>index 5385074b3b7d..e7381f8ef89d 100644
>--- a/drivers/net/ethernet/freescale/fec.h
>+++ b/drivers/net/ethernet/freescale/fec.h
>@@ -20,7 +20,8 @@
> #include 
>
> #if defined(CONFIG_M523x) || defined(CONFIG_M527x) ||
>defined(CONFIG_M528x) || \
>-defined(CONFIG_M520x) || defined(CONFIG_M532x) ||
>defined(CONFIG_ARM)
>+defined(CONFIG_M520x) || defined(CONFIG_M532x) ||
>defined(CONFIG_ARM) || \
>+defined(CONFIG_ARM64)
> /*
>  *Just figures, Motorola would have to change the offsets for
>  *registers in the same peripheral device on different models
>@@ -195,7 +196,7 @@
>  *Evidently, ARM SoCs have the FEC block generated in a
>  *little endian mode so adjust endianness accordingly.
>  */
>-#if defined(CONFIG_ARM)
>+#if defined(CONFIG_ARM) || defined(CONFIG_ARM64)
> #define fec32_to_cpu le32_to_cpu
> #define fec16_to_cpu le16_to_cpu
> #define cpu_to_fec32 cpu_to_le32
>diff --git a/drivers/net/ethernet/freescale/fec_main.c
>b/drivers/net/ethernet/freescale/fec_main.c
>index a74300a4459c..45eed37373a9 100644
>--- a/drivers/net/ethernet/freescale/fec_main.c
>+++ b/drivers/net/ethernet/freescale/fec_main.c
>@@ -195,7 +195,8 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC
>address");
>  * account when setting it.
>  */
> #if defined(CONFIG_M523x) || defined(CONFIG_M527x) ||
>defined(CONFIG_M528x) || \
>-defined(CONFIG_M520x) || defined(CONFIG_M532x) ||
>defined(CONFIG_ARM)
>+defined(CONFIG_M520x) || defined(CONFIG_M532x) ||
>defined(CONFIG_ARM) || \
>+defined(CONFIG_ARM64)
> #define   OPT_FRAME_SIZE  (PKT_MAXBUF_SIZE << 16)
> #else
> #define   OPT_FRAME_SIZE  0
>@@ -2107,7 +2108,8 @@ static int fec_enet_get_regs_len(struct net_device
>*ndev)
>
> /* List of registers that can be safety be read to dump them with ethtool */
>#if defined(CONFIG_M523x) || defined(CONFIG_M527x) ||
>defined(CONFIG_M528x) || \
>-  defined(CONFIG_M520x) || defined(CONFIG_M532x) ||
>defined(CONFIG_ARM)
>+  defined(CONFIG_M520x) || defined(CONFIG_M532x) ||
>defined(CONFIG_ARM) || \
>+  defined(CONFIG_ARM64)
> static u32 fec_enet_register_offset[] = {
>   FEC_IEVENT, FEC_IMASK, FEC_R_DES_ACTIVE_0,
>FEC_X_DES_ACTIVE_0,
>   FEC_ECNTRL, FEC_MII_DATA, FEC_MII_SPEED, FEC_MIB_CTRLSTAT,
>FEC_R_CNTRL, @@ -3119,7 +3121,7 @@ static int fec_enet_init(struct
>net_device *ndev)
>   unsigned dsize_log2 = __fls(dsize);
>
>   WARN_ON(dsize != (1 << dsize_log2));
>-#if defined(CONFIG_ARM)
>+#if defined(CONFIG_ARM) || defined(CONFIG_ARM64)
>   fep->rx_align = 0xf;
>   fep->tx_align = 0xf;
> #else
>--
>2.11.0

Kconfig also need to add ARM64 support.


Re: dangers of bots on the mailing lists was Re: divide error in ___bpf_prog_run

2018-01-17 Thread Eric Biggers
On Wed, Jan 17, 2018 at 05:18:17PM -0800, Joe Perches wrote:
> On Wed, 2018-01-17 at 20:09 -0500, Theodore Ts'o wrote:
> > get_maintainer.pl, which is often not accurate
> 
> Examples please.
> 

Well, the primary problem is that place the crash occurs is not necessarily
responsible for the bug.  But, syzbot actually does have a file blacklist for
exactly that reason; see
https://github.com/google/syzkaller/blob/master/pkg/report/linux.go#L56

It definitely needs further improvement (and anyone is welcome to contribute),
though it will never be perfect.  

There is also a KASAN change by Dmitry queued up for 4.16 that will allow KASAN
to detect invalid frees.  That would have detected the bug in crypto/pcrypt.c
that was causing corruption in the kmalloc-1024 slab cache, and was causing
crashes in all sorts of random kernel code, resulting many bug reports.  So,
detecting bugs early before they corrupt all sorts of random kernel data
structures helps a lot too.

And yes, get_maintainer.pl sometimes isn't accurate even if the offending code
is correctly identified.  That's more of a community problem, e.g. people
sometimes don't bother to remove themselves from MAINTAINERS when they quit
maintaining, and sometimes people don't feel responsible enough for a file to
add themselves to MAINTAINERS, even when in practice they are actually taking
most of the patches to it through their tree.

Eric


[PATCH net-next] net: hns: Fix for variable may be used uninitialized warnings

2018-01-17 Thread Huazhong Tan
When !CONFIG_REGMAP hns throws compiler warnings since
dsaf_read_syscon ignores the return result from regmap_read,
which allows val to be uninitialized.

Fixes: 86897c960b49 ("net: hns: add syscon operation for dsaf")
Reported-by: Jason Gunthorpe 
Signed-off-by: Huazhong Tan 
Signed-off-by: Yunsheng Lin 
---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c | 52 +++---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h  |  7 +--
 2 files changed, 38 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c
index ca247c2..acf2963 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c
@@ -44,12 +44,17 @@ static void dsaf_write_sub(struct dsaf_device *dsaf_dev, 
u32 reg, u32 val)
 
 static u32 dsaf_read_sub(struct dsaf_device *dsaf_dev, u32 reg)
 {
-   u32 ret;
-
-   if (dsaf_dev->sub_ctrl)
-   ret = dsaf_read_syscon(dsaf_dev->sub_ctrl, reg);
-   else
+   u32 ret = 0;
+   int err;
+
+   if (dsaf_dev->sub_ctrl) {
+   err = dsaf_read_syscon(dsaf_dev->sub_ctrl, reg, &ret);
+   if (err)
+   dev_err(dsaf_dev->dev, "dsaf_read_syscon error %d!\n",
+   err);
+   } else {
ret = dsaf_read_reg(dsaf_dev->sc_base, reg);
+   }
 
return ret;
 }
@@ -188,18 +193,23 @@ static void cpld_led_reset_acpi(struct hns_mac_cb *mac_cb)
 static int cpld_set_led_id(struct hns_mac_cb *mac_cb,
   enum hnae_led_state status)
 {
+   u32 val = 0;
+   int ret;
+
if (!mac_cb->cpld_ctrl)
return 0;
 
switch (status) {
case HNAE_LED_ACTIVE:
-   mac_cb->cpld_led_value =
-   dsaf_read_syscon(mac_cb->cpld_ctrl,
-mac_cb->cpld_ctrl_reg);
-   dsaf_set_bit(mac_cb->cpld_led_value, DSAF_LED_ANCHOR_B,
-CPLD_LED_ON_VALUE);
+   ret = dsaf_read_syscon(mac_cb->cpld_ctrl, mac_cb->cpld_ctrl_reg,
+  &val);
+   if (ret)
+   return ret;
+
+   dsaf_set_bit(val, DSAF_LED_ANCHOR_B, CPLD_LED_ON_VALUE);
dsaf_write_syscon(mac_cb->cpld_ctrl, mac_cb->cpld_ctrl_reg,
- mac_cb->cpld_led_value);
+ val);
+   mac_cb->cpld_led_value = val;
break;
case HNAE_LED_INACTIVE:
dsaf_set_bit(mac_cb->cpld_led_value, DSAF_LED_ANCHOR_B,
@@ -560,12 +570,19 @@ static phy_interface_t hns_mac_get_phy_if_acpi(struct 
hns_mac_cb *mac_cb)
 
 int hns_mac_get_sfp_prsnt(struct hns_mac_cb *mac_cb, int *sfp_prsnt)
 {
+   u32 val = 0;
+   int ret;
+
if (!mac_cb->cpld_ctrl)
return -ENODEV;
 
-   *sfp_prsnt = !dsaf_read_syscon(mac_cb->cpld_ctrl, mac_cb->cpld_ctrl_reg
-   + MAC_SFP_PORT_OFFSET);
+   ret = dsaf_read_syscon(mac_cb->cpld_ctrl,
+  mac_cb->cpld_ctrl_reg + MAC_SFP_PORT_OFFSET,
+  &val);
+   if (ret)
+   return ret;
 
+   *sfp_prsnt = !val;
return 0;
 }
 
@@ -615,7 +632,7 @@ static int hns_mac_config_sds_loopback(struct hns_mac_cb 
*mac_cb, bool en)
 #define RX_CSR(lane, reg) ((0x4080 + (reg) * 0x0002 + (lane) * 0x0200) * 2)
u64 reg_offset = RX_CSR(lane_id[mac_cb->mac_id], 0);
 
-   int sfp_prsnt;
+   int sfp_prsnt = 0;
int ret = hns_mac_get_sfp_prsnt(mac_cb, &sfp_prsnt);
 
if (!mac_cb->phy_dev) {
@@ -627,7 +644,7 @@ static int hns_mac_config_sds_loopback(struct hns_mac_cb 
*mac_cb, bool en)
}
 
if (mac_cb->serdes_ctrl) {
-   u32 origin;
+   u32 origin = 0;
 
if (!AE_IS_VER1(mac_cb->dsaf_dev->dsaf_ver)) {
 #define HILINK_ACCESS_SEL_CFG  0x40008
@@ -644,7 +661,10 @@ static int hns_mac_config_sds_loopback(struct hns_mac_cb 
*mac_cb, bool en)
  HILINK_ACCESS_SEL_CFG, 3);
}
 
-   origin = dsaf_read_syscon(mac_cb->serdes_ctrl, reg_offset);
+   ret = dsaf_read_syscon(mac_cb->serdes_ctrl, reg_offset,
+  &origin);
+   if (ret)
+   return ret;
 
dsaf_set_field(origin, 1ull << 10, 10, en);
dsaf_write_syscon(mac_cb->serdes_ctrl, reg_offset, origin);
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h
index 46a52d9..886cbbf 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h
@@ -1034,12 +1034,9 @@ static inline void dsaf_write_syscon(s

Re: [PATCH] net/mlx5e: Fix trailing semicolon

2018-01-17 Thread Saeed Mahameed


On 01/17/2018 04:09 AM, Luis de Bethencourt wrote:
> The trailing semicolon is an empty statement that does no operation.
> Removing it since it doesn't do anything.
> 
> Signed-off-by: Luis de Bethencourt 

Looks Good, Thanks Luis.

Reviewed-by: Saeed Mahameed 



Re: [PATCHv2 net-next 8/8] net: sched: cls_u32: add extack support

2018-01-17 Thread Jakub Kicinski
On Wed, 17 Jan 2018 17:40:27 -0500, Alexander Aring wrote:
> diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
> index 8840baa1b9b4..daeac7282387 100644
> --- a/net/sched/cls_u32.c
> +++ b/net/sched/cls_u32.c
> @@ -524,8 +524,10 @@ static int u32_replace_hw_hnode(struct tcf_proto *tp, 
> struct tc_u_hnode *h,
>   offloaded = true;
>   }
>  
> - if (skip_sw && !offloaded)
> + if (skip_sw && !offloaded) {
> + NL_SET_ERR_MSG(extack, "Failed to offload filter requested with 
> skip sw");
>   return -EINVAL;
> + }
>  
>   return 0;
>  }

Why did you ignore my comment about using NL_SET_ERR_MSG_MOD?  Do you
disagree with it?


Re: [PATCHv2 net-next 7/8] net: sched: cls: add extack support for tc_setup_cb_call

2018-01-17 Thread Jakub Kicinski
On Wed, 17 Jan 2018 17:40:26 -0500, Alexander Aring wrote:
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 9f88107c29c5..e864ad523800 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -1566,21 +1566,26 @@ static int tc_exts_setup_cb_egdev_call(struct 
> tcf_exts *exts,
>  }
>  
>  int tc_setup_cb_call(struct tcf_block *block, struct tcf_exts *exts,
> -  enum tc_setup_type type, void *type_data, bool err_stop)
> +  enum tc_setup_type type, void *type_data, bool err_stop,
> +  struct netlink_ext_ack *extack)
>  {
>   int ok_count;
>   int ret;
>  
>   ret = tcf_block_cb_call(block, type, type_data, err_stop);
> - if (ret < 0)
> + if (ret < 0) {
> + NL_SET_ERR_MSG(extack, "Failed to initialize tcf block");

This has nothing to do with block init.

>   return ret;
> + }
>   ok_count = ret;
>  
>   if (!exts)
>   return ok_count;
>   ret = tc_exts_setup_cb_egdev_call(exts, type, type_data, err_stop);
> - if (ret < 0)
> + if (ret < 0) {
> + NL_SET_ERR_MSG(extack, "Failed to initialize tcf block 
> extensions");

Ditto, plus this is about redirections to other devices (hence
eg[ress ]dev).  exts part is an internal detail.

>   return ret;
> + }
>   ok_count += ret;
>  
>   return ok_count;

These messages are completely off-target, and this is not a right place
to handle any of this.

Quentin's patches do the correct thing for offload, please drop this
patch from the series and let us handle the offload cases.

Thank you.


linux-next: build warning after merge of the net-next tree

2018-01-17 Thread Stephen Rothwell
Hi all,

After merging the net-next tree, today's linux-next build (powerpc
ppc64_defconfig) produced this warning:

net/sched/cls_api.c: In function 'tc_dump_tfilter':
net/sched/cls_api.c:1362:8: warning: 'parent' may be used uninitialized in this 
function [-Wmaybe-uninitialized]
   if (!tcf_chain_dump(chain, q, parent, skb, cb,
^
 index_start, &index))
 

Probably introduced by commit

  7960d1daf278 ("net: sched: use block index as a handle instead of qdisc when 
block is shared")

-- 
Cheers,
Stephen Rothwell


Re: dangers of bots on the mailing lists was Re: divide error in ___bpf_prog_run

2018-01-17 Thread Joe Perches
On Wed, 2018-01-17 at 20:09 -0500, Theodore Ts'o wrote:
> get_maintainer.pl, which is often not accurate

Examples please.



pull-request: bpf 2018-01-18

2018-01-17 Thread Daniel Borkmann
Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a divide by zero due to wrong if (src_reg == 0) check in
   64-bit mode. Properly handle this in interpreter and mask it
   also generically in verifier to guard against similar checks
   in JITs, from Eric and Alexei.

2) Fix a bug in arm64 JIT when tail calls are involved and progs
   have different stack sizes, from Daniel.

3) Reject stores into BPF context that are not expected BPF_STX |
   BPF_MEM variant, from Daniel.

4) Mark dst reg as unknown on {s,u}bounds adjustments when the
   src reg has derived bounds from dead branches, from Daniel.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!



The following changes since commit 8155aedf512edd3f88ef19f7cacf476ace7d1322:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf (2018-01-14 
11:01:33 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to 6f16101e6a8b4324c36e58a29d9e0dbb287cdedb:

  bpf: mark dst unknown on inconsistent {s, u}bounds adjustments (2018-01-17 
16:23:17 -0800)


Alexei Starovoitov (1):
  bpf: fix 32-bit divide by zero

Daniel Borkmann (3):
  bpf, arm64: fix stack_depth tracking in combination with tail calls
  bpf: reject stores into ctx via st and xadd
  bpf: mark dst unknown on inconsistent {s, u}bounds adjustments

Eric Dumazet (1):
  bpf: fix divides by zero

 arch/arm64/net/bpf_jit_comp.c   |  20 ++--
 kernel/bpf/core.c   |   4 +-
 kernel/bpf/verifier.c   |  64 ++--
 net/core/filter.c   |   4 +
 tools/testing/selftests/bpf/test_verifier.c | 152 +++-
 5 files changed, 219 insertions(+), 25 deletions(-)


[PATCH net-next] macvlan: Pass SIOC[SG]HWTSTAMP ioctls and get_ts_info to lower device

2018-01-17 Thread grzegorz . halat
This patch allows to enable hardware timestamping on macvlan intefaces and find 
out capabilities of the lower device.

Signed-off-by: Grzegorz Halat 
---
 drivers/net/macvlan.c | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index a0f2be8..314e878 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -942,6 +943,30 @@ static void macvlan_dev_get_stats64(struct net_device *dev,
}
 }

+static int macvlan_do_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
+{
+   struct macvlan_dev *vlan = netdev_priv(dev);
+   const struct net_device_ops *ops = vlan->lowerdev->netdev_ops;
+   struct ifreq ifrl;
+   int err = -EOPNOTSUPP;
+
+   strncpy(ifrl.ifr_name, vlan->lowerdev->name, IFNAMSIZ);
+   ifrl.ifr_ifru = ifr->ifr_ifru;
+
+   switch (cmd) {
+   case SIOCGHWTSTAMP:
+   case SIOCSHWTSTAMP:
+   if (ops->ndo_do_ioctl)
+   err = ops->ndo_do_ioctl(vlan->lowerdev, &ifrl, cmd);
+   break;
+   }
+
+   if (!err)
+   ifr->ifr_ifru = ifrl.ifr_ifru;
+
+   return err;
+}
+
 static int macvlan_vlan_rx_add_vid(struct net_device *dev,
   __be16 proto, u16 vid)
 {
@@ -1022,6 +1047,22 @@ static int macvlan_ethtool_get_link_ksettings(struct 
net_device *dev,
return __ethtool_get_link_ksettings(vlan->lowerdev, cmd);
 }

+static int macvlan_ethtool_get_ts_info(struct net_device *dev,
+   struct ethtool_ts_info *ts_info)
+{
+   const struct macvlan_dev *vlan = netdev_priv(dev);
+   const struct ethtool_ops *eth_ops = vlan->lowerdev->ethtool_ops;
+
+   if (eth_ops->get_ts_info)
+   return eth_ops->get_ts_info(vlan->lowerdev, ts_info);
+
+   ts_info->so_timestamping = SOF_TIMESTAMPING_RX_SOFTWARE |
+  SOF_TIMESTAMPING_SOFTWARE;
+   ts_info->phc_index = -1;
+
+   return 0;
+}
+
 static netdev_features_t macvlan_fix_features(struct net_device *dev,
  netdev_features_t features)
 {
@@ -1096,6 +1137,7 @@ static const struct ethtool_ops macvlan_ethtool_ops = {
.get_link   = ethtool_op_get_link,
.get_link_ksettings = macvlan_ethtool_get_link_ksettings,
.get_drvinfo= macvlan_ethtool_get_drvinfo,
+   .get_ts_info= macvlan_ethtool_get_ts_info,
 };

 static const struct net_device_ops macvlan_netdev_ops = {
@@ -,6 +1153,7 @@ static const struct net_device_ops macvlan_netdev_ops = {
.ndo_set_rx_mode= macvlan_set_mac_lists,
.ndo_get_stats64= macvlan_dev_get_stats64,
.ndo_validate_addr  = eth_validate_addr,
+   .ndo_do_ioctl   = macvlan_do_ioctl,
.ndo_vlan_rx_add_vid= macvlan_vlan_rx_add_vid,
.ndo_vlan_rx_kill_vid   = macvlan_vlan_rx_kill_vid,
.ndo_fdb_add= macvlan_fdb_add,
--
2.7.4


Re: dangers of bots on the mailing lists was Re: divide error in ___bpf_prog_run

2018-01-17 Thread Theodore Ts'o
On Wed, Jan 17, 2018 at 04:21:13PM -0800, Alexei Starovoitov wrote:
> 
> If syzkaller can only test one tree than linux-next should be the one.

Well, there's been some controversy about that.  The problem is that
it's often not clear if this is long-standing bug, or a bug which is
in a particular subsystem tree --- and if so, *which* subsystem tree,
etc.  So it gets blasted to linux-kernel, and to get_maintainer.pl,
which is often not accurate --- since the location of the crash
doesn't necessarily point out where the problem originated, and hence
who should look at the syzbot report.  And so this has caused
some irritation.

> There is some value of testing stable trees, but any developer
> will first ask for a reproducer in the latest, so usefulness of
> reporting such bugs will be limited.

What I suggested was to test Linus's tree, and then when a problem is
found, and syzkaller has a reliable repro, to *then* try to see if it
*also* shows up in the LTS kernels.

- Ted


linux-next: manual merge of the net-next tree with the net tree

2018-01-17 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  drivers/net/tun.c

between commit:

  4df0bfc79904 ("tun: fix a memory leak for tfile->tx_array")

from the net tree and commits:

  8bf5c4ee1889 ("tun: setup xdp_rxq_info")
  5990a30510ed ("tun/tap: use ptr_ring instead of skb_array")
  fc72d1d54dd9 ("tuntap: XDP transmission")

from the net-next tree.

I have no idea how to clean this up, so I effectively reverted the net
tree commit.

I fixed it up (see above) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell


[PATCH bpf-next 0/2] bpf: increase bpf test coverage

2018-01-17 Thread Alexei Starovoitov
BPF verifier has 700+ tests used to check correctness of the verifier.
Beyond checking the verifier log tell kernel to run accepted programs as well
via bpf_prog_test_run() command. That improves quality of the tests
and increases bpf test coverage.

Alexei Starovoitov (2):
  bpf: allow socket_filter programs to use bpf_prog_test_run
  selftests/bpf: make test_verifier run most programs

 kernel/bpf/syscall.c|  2 ++
 net/core/filter.c   |  1 +
 tools/testing/selftests/bpf/test_verifier.c | 50 -
 3 files changed, 52 insertions(+), 1 deletion(-)

-- 
2.9.5



[PATCH bpf-next 1/2] bpf: allow socket_filter programs to use bpf_prog_test_run

2018-01-17 Thread Alexei Starovoitov
in order to improve test coverage allow socket_filter program type
to be run via bpf_prog_test_run command.
Since such programs can be loaded by non-root tighten
permissions for bpf_prog_test_run to be root only
to avoid surprises.

Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/syscall.c | 2 ++
 net/core/filter.c| 1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c691b9e972e3..08ca42b1b916 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1504,6 +1504,8 @@ static int bpf_prog_test_run(const union bpf_attr *attr,
struct bpf_prog *prog;
int ret = -ENOTSUPP;
 
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
if (CHECK_ATTR(BPF_PROG_TEST_RUN))
return -EINVAL;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index db2ee8c7e1bd..30fafaaa90fa 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4526,6 +4526,7 @@ const struct bpf_verifier_ops sk_filter_verifier_ops = {
 };
 
 const struct bpf_prog_ops sk_filter_prog_ops = {
+   .test_run   = bpf_prog_test_run_skb,
 };
 
 const struct bpf_verifier_ops tc_cls_act_verifier_ops = {
-- 
2.9.5



[PATCH bpf-next 2/2] selftests/bpf: make test_verifier run most programs

2018-01-17 Thread Alexei Starovoitov
to improve test coverage make test_verifier run all successfully loaded
programs on 64-byte zero initialized data.
For clsbpf and xdp it means empty 64-byte packet.
For lwt and socket_filters it's 64-byte packet where skb->data
points after L2.

Signed-off-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/test_verifier.c | 50 -
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 543847957fdd..87da8d1bf2dd 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -49,6 +50,8 @@
 #define MAX_INSNS  512
 #define MAX_FIXUPS 8
 #define MAX_NR_MAPS4
+#define POINTER_VALUE  0xcafe4all
+#define TEST_DATA_LEN  64
 
 #define F_NEEDS_EFFICIENT_UNALIGNED_ACCESS (1 << 0)
 #define F_LOAD_WITH_STRICT_ALIGNMENT   (1 << 1)
@@ -62,6 +65,7 @@ struct bpf_test {
int fixup_map_in_map[MAX_FIXUPS];
const char *errstr;
const char *errstr_unpriv;
+   uint32_t retval;
enum {
UNDEF,
ACCEPT,
@@ -95,6 +99,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.result = ACCEPT,
+   .retval = -3,
},
{
"unreachable",
@@ -210,6 +215,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.result = ACCEPT,
+   .retval = 1,
},
{
"test8 ld_imm64",
@@ -477,6 +483,7 @@ static struct bpf_test tests[] = {
.errstr_unpriv = "R0 leaks addr",
.result = ACCEPT,
.result_unpriv = REJECT,
+   .retval = POINTER_VALUE,
},
{
"check valid spill/fill, skb mark",
@@ -763,6 +770,7 @@ static struct bpf_test tests[] = {
.errstr_unpriv = "R1 pointer comparison",
.result_unpriv = REJECT,
.result = ACCEPT,
+   .retval = -ENOENT,
},
{
"jump test 4",
@@ -1783,6 +1791,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.result = ACCEPT,
+   .retval = 0xfaceb00c,
},
{
"PTR_TO_STACK store/load - bad alignment on off",
@@ -1841,6 +1850,7 @@ static struct bpf_test tests[] = {
.result = ACCEPT,
.result_unpriv = REJECT,
.errstr_unpriv = "R0 leaks addr",
+   .retval = POINTER_VALUE,
},
{
"unpriv: add const to pointer",
@@ -2014,6 +2024,7 @@ static struct bpf_test tests[] = {
BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6, 0),
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
 BPF_FUNC_get_hash_recalc),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
.result = ACCEPT,
@@ -2778,6 +2789,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .retval = 1,
},
{
"direct packet access: test12 (and, good access)",
@@ -2802,6 +2814,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .retval = 1,
},
{
"direct packet access: test13 (branches, good access)",
@@ -2832,6 +2845,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .retval = 1,
},
{
"direct packet access: test14 (pkt_ptr += 0, CONST_IMM, good 
access)",
@@ -2855,6 +2869,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .retval = 1,
},
{
"direct packet access: test15 (spill with xadd)",
@@ -3141,6 +3156,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .retval = 1,
},
{
"direct packet access: test28 (marking on <=, bad access)",
@@ -5758,6 +5774,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .retval = 0 /* csum_diff of 64-byte packet */,
},
{
"helper access to variable memory: size = 0 not allow

Re: [bpf-next PATCH] samples/bpf: xdp2skb_meta comment explain why pkt-data pointers are invalidated

2018-01-17 Thread Daniel Borkmann
On 01/17/2018 11:17 AM, Jesper Dangaard Brouer wrote:
> Improve the 'unknown reason' comment, with an actual explaination of why
> the ctx pkt-data pointers need to be loaded after the helper function
> bpf_xdp_adjust_meta().  Based on the explaination Daniel gave.
> 
> Fixes: 36e04a2d78d9 ("samples/bpf: xdp2skb_meta shows transferring info from 
> XDP to SKB")
> Reported-by: Daniel Borkmann 
> Signed-off-by: Jesper Dangaard Brouer 

Applied to bpf-next, thanks for following up on this, Jesper!


Re: [PATCH bpf-next 0/3] bpf: add dumping and disassembler for non-host JITs

2018-01-17 Thread Daniel Borkmann
On 01/17/2018 01:05 AM, Jakub Kicinski wrote:
> Hi,
> 
> Currently bpftool could disassemble host jited image, for example x86_64,
> using libbfd. However it couldn't disassemble offload jited image.
> 
> There are two reasons:
> 
>   1. bpf_obj_get_info_by_fd/struct bpf_prog_info couldn't get the address
>  of jited image and image's length.
> 
>   2. Even after issue 1 resolved, bpftool couldn't figure out what is the
>  offload arch from bpf_prog_info, therefore can't drive libbfd
>  disassembler correctly.
> 
>   This patch set resolve issue 1 by introducing two new fields "jited_len"
> and "jited_image" in bpf_dev_offload. These two fields serve as the generic
> interface to communicate the jited image address and length for all offload
> targets to higher level caller. For example, bpf_obj_get_info_by_fd could
> use them to fill the userspace visible fields jited_prog_len and
> jited_prog_insns.
> 
>   This patch set resolve issue 2 by getting bfd backend name through
> "ifindex", i.e network interface index.
> 
> v1:
>  - Deduct bfd arch name through ifindex, i.e network interface index.
>First, map ifindex to devname through ifindex_to_name_ns, then get
>pci id through /sys/class/dev/DEVNAME/device/vendor. (Daniel, Alexei)

Looks good, series applied to bpf-next, thanks guys!


Re: [PATCH bpf v2] bpf: mark dst unknown on inconsistent {s,u}bounds adjustments

2018-01-17 Thread Alexei Starovoitov
On Thu, Jan 18, 2018 at 01:15:21AM +0100, Daniel Borkmann wrote:
> syzkaller generated a BPF proglet and triggered a warning with
> the following:
> 
>   0: (b7) r0 = 0
>   1: (d5) if r0 s<= 0x0 goto pc+0
>R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
>   2: (1f) r0 -= r1
>R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
>   verifier internal error: known but bad sbounds
> 
> What happens is that in the first insn, r0's min/max value
> are both 0 due to the immediate assignment, later in the jsle
> test the bounds are updated for the min value in the false
> path, meaning, they yield smin_val = 1, smax_val = 0, and when
> ctx pointer is subtracted from r0, verifier bails out with the
> internal error and throwing a WARN since smin_val != smax_val
> for the known constant.
> 
> For min_val > max_val scenario it means that reg_set_min_max()
> and reg_set_min_max_inv() (which both refine existing bounds)
> demonstrated that such branch cannot be taken at runtime.
> 
> In above scenario for the case where it will be taken, the
> existing [0, 0] bounds are kept intact. Meaning, the rejection
> is not due to a verifier internal error, and therefore the
> WARN() is not necessary either.
> 
> We could just reject such cases in adjust_{ptr,scalar}_min_max_vals()
> when either known scalars have smin_val != smax_val or
> umin_val != umax_val or any scalar reg with bounds
> smin_val > smax_val or umin_val > umax_val. However, there
> may be a small risk of breakage of buggy programs, so handle
> this more gracefully and in adjust_{ptr,scalar}_min_max_vals()
> just taint the dst reg as unknown scalar when we see ops with
> such kind of src reg.
> 
> Reported-by: syzbot+6d362cadd45dc0a12...@syzkaller.appspotmail.com
> Signed-off-by: Daniel Borkmann 

there are several ways to fix it and this approach looks
to be the safest way to do it since we're so late into the release.
Applied, Thank you Daniel.



Re: [PATCH RFC net-next 6/6] rds: zerocopy Tx support.

2018-01-17 Thread Willem de Bruijn
On Wed, Jan 17, 2018 at 7:20 AM, Sowmini Varadhan
 wrote:
> If the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and,
> if the SO_ZEROCOPY socket option has been set on the PF_RDS socket,
> application pages sent down with rds_sendmsg() are pinned.
>
> The pinning uses the accounting infrastructure added by
> Commit a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages")
>
> The payload bytes in the message may not be modified for the
> duration that the message has been pinned. A multi-threaded
> application using this infrastructure may thus need to be notified
> about send-completion so that it can free/reuse the buffers
> passed to rds_sendmsg(). Notification of send-completion will
> identify each message-buffer by a cookie that the application
> must specify as ancillary data to rds_sendmsg().
> The ancillary data in this case has cmsg_level == SOL_RDS
> and cmsg_type == RDS_CMSG_ZCOPY_COOKIE.
>
> Signed-off-by: Sowmini Varadhan 
> ---
>  include/uapi/linux/rds.h |1 +
>  net/rds/message.c|   44 +++-
>  net/rds/rds.h|3 ++-
>  net/rds/send.c   |   27 ---
>  4 files changed, 70 insertions(+), 5 deletions(-)

> +static int rds_cmsg_zcopy(struct rds_sock *rs, struct rds_message *rm,
> + struct cmsghdr *cmsg)
> +{
> +   unsigned int *cookie;

Use fixed-width types across the ABI.

> +
> +   if (cmsg->cmsg_len < CMSG_LEN(sizeof(*cookie)))
> +   return -EINVAL;
> +   cookie = CMSG_DATA(cmsg);
> +   rm->data.op_mmp_znotifier->z_cookie = *cookie;
> +   return 0;
> +}
> +

> @@ -1107,12 +1126,14 @@ int rds_sendmsg(struct socket *sock, struct msghdr 
> *msg, size_t payload_len)
>
> /* Attach data to the rm */
> if (payload_len) {
> -   rm->data.op_sg = rds_message_alloc_sgs(rm, ceil(payload_len, 
> PAGE_SIZE));
> +   int num_sgs = ceil(payload_len, PAGE_SIZE);
> +
> +   rm->data.op_sg = rds_message_alloc_sgs(rm, num_sgs);

Unrelated change. Leftover from revising the patch?

Also, with user pages, data is no longer guaranteed to be page
aligned, so this may
need more sgs.


Re: [PATCH net-next] net: core: Expose number of link up/down transitions

2018-01-17 Thread David Ahern
On 1/17/18 3:06 PM, Florian Fainelli wrote:
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 7bf8b85ade16..9f732c3dc2ce 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -295,10 +295,29 @@ static ssize_t carrier_changes_show(struct device *dev,
>   struct net_device *netdev = to_net_dev(dev);
>  
>   return sprintf(buf, fmt_dec,
> -atomic_read(&netdev->carrier_changes));
> +atomic_read(&netdev->count_link_up) +
> +atomic_read(&netdev->count_link_down));
>  }
>  static DEVICE_ATTR_RO(carrier_changes);
>  
> +static ssize_t count_link_up_show(struct device *dev,
> +   struct device_attribute *attr, char *buf)
> +{
> + struct net_device *netdev = to_net_dev(dev);
> +
> + return sprintf(buf, fmt_dec, atomic_read(&netdev->count_link_up));
> +}
> +static DEVICE_ATTR_RO(count_link_up);
> +
> +static ssize_t count_link_down_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct net_device *netdev = to_net_dev(dev);
> +
> + return sprintf(buf, fmt_dec, atomic_read(&netdev->count_link_down));
> +}
> +static DEVICE_ATTR_RO(count_link_down);
> +
>  /* read-write attributes */
>  
>  static int change_mtu(struct net_device *dev, unsigned long new_mtu)
> @@ -547,6 +566,8 @@ static struct attribute *net_class_attrs[] 
> __ro_after_init = {
>   &dev_attr_phys_port_name.attr,
>   &dev_attr_phys_switch_id.attr,
>   &dev_attr_proto_down.attr,
> + &dev_attr_count_link_up.attr,
> + &dev_attr_count_link_down.attr,
>   NULL,
>  };
>  ATTRIBUTE_GROUPS(net_class);

Personally, I do not like adding any more sysfs files. As discussed in
the past sysfs is a huge contributor to the overhead of netdevs.


Re: [RFC v2 net-next 02/10] net: ipv4: raw: Hook into time based transmission.

2018-01-17 Thread Eric Dumazet
On Wed, 2018-01-17 at 15:06 -0800, Jesus Sanchez-Palencia wrote:
> From: Richard Cochran 
> 
> For raw packets, copy the desired future transmit time from the CMSG
> cookie into the skb.
> 
> Signed-off-by: Richard Cochran 
> Signed-off-by: Jesus Sanchez-Palencia 
> ---
>  net/ipv4/raw.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
> index 136544b36a46..e37ea8ab6a78 100644
> --- a/net/ipv4/raw.c
> +++ b/net/ipv4/raw.c
> @@ -381,6 +381,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 
> *fl4,
>  
>   skb->priority = sk->sk_priority;
>   skb->mark = sk->sk_mark;
> + skb->tstamp = sockc->transmit_time;
>   skb_dst_set(skb, &rt->dst);
>   *rtp = NULL;
>  
> @@ -562,6 +563,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr 
> *msg, size_t len)
>   }
>  
>   ipc.sockc.tsflags = sk->sk_tsflags;
> + ipc.sockc.transmit_time = 0;
>   ipc.addr = inet->inet_saddr;
>   ipc.opt = NULL;
>   ipc.tx_flags = 0;


It seems that skb_scrub_packet() will clear skb->tstamp, meaning  that
going through a tunnel will break your feature.

Maybe we need to remove skb->tstamp clear from skb_scrub_packet() and
do the cleaning only in forwarding path.




Re: [PATCH RFC net-next 5/6] rds: support for zcopy completion notification

2018-01-17 Thread Willem de Bruijn
On Wed, Jan 17, 2018 at 7:20 AM, Sowmini Varadhan
 wrote:
> RDS removes a datagram from the retransmit queue when an ACK is
> received. The ACK indicates that the receiver has queued the
> RDS datagram, so that the sender can safely forget the datagram.
>
> If the datagram to be removed had pinned pages set up, add
> an entry to the rs->rs_znotify_queue so that the notifcation
> will be sent up via rds_rm_zerocopy_callback() when the
> rds_message is eventually freed by rds_message_purge.
>
> Signed-off-by: Sowmini Varadhan 
> ---

> +static void rds_rm_zerocopy_callback(struct rds_sock *rs)
> +{
> +   struct sock *sk = rds_rs_to_sk(rs);
> +   struct sk_buff *skb;
> +   struct sock_exterr_skb *serr;
> +   struct sk_buff_head *q;
> +   unsigned long flags;
> +   struct sk_buff *tail;
> +   u32 *ptr;
> +   int ncookies = 0, i;
> +   struct rds_znotifier *znotif, *ztmp;
> +   LIST_HEAD(tmp_list);
> +
> +   spin_lock_irqsave(&rs->rs_lock, flags);
> +   list_splice(&rs->rs_znotify_queue, &tmp_list);
> +   INIT_LIST_HEAD(&rs->rs_znotify_queue);
> +   spin_unlock_irqrestore(&rs->rs_lock, flags);
> +
> +   list_for_each_entry_safe(znotif, ztmp, &tmp_list, z_list)
> +   ncookies++;
> +   if (ncookies == 0)
> +   return;
> +   skb = alloc_skb(ncookies * sizeof(u32), GFP_ATOMIC);
> +   if (!skb) {
> +   spin_lock_irqsave(&rs->rs_lock, flags);
> +   list_splice(&tmp_list, &rs->rs_znotify_queue);
> +   spin_unlock_irqrestore(&rs->rs_lock, flags);
> +   return;
> +   }

TCP zerocopy avoids this issue by allocating the notification skb when the
zerocopy packet is created, which would be rds_message_copy_from_user.

This does not add an allocation, if using the same trick of stashing
the intermediate notification object (op_mmp_znotifier) in skb->cb. Though,
alloc_skb is probably more expensive than that kzalloc. If nothing else,
because of more zeroing.

> +   serr = SKB_EXT_ERR(skb);
> +   serr->ee.ee_errno = 0;
> +   serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY;
> +   serr->ee.ee_data = ncookies;

This changes the semantics of these fields. Please add a new SO_EE_CODE flag,
even if the semantics can be derived from the packet family (for now).

Even better would be if we can avoid the cookies completely. I understand
the issue with concurrent send threads racing on obtaining a zckey value. If
the sender could learn which zckey was chosen for a call, would that suffice?

I suspect that in even with concurrent senders, notifications arrive largely in
order, in which case we could just maintain the existing semantics and even
reuse that implementation.

> +   serr->ee.ee_info = 0;
> +   serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED;

Why set the copied bit here?

> +   ptr = skb_put(skb, ncookies * sizeof(u32));
> +
> +   i = 0;
> +   list_for_each_entry_safe(znotif, ztmp, &tmp_list, z_list) {
> +   list_del(&znotif->z_list);
> +   ptr[i++] = znotif->z_cookie;
> +   mm_unaccount_pinned_pages(&znotif->z_mmp);
> +   kfree(znotif);
> +   }
> +   WARN_ON(!list_empty(&tmp_list));
> +   q = &sk->sk_error_queue;
> +   spin_lock_irqsave(&q->lock, flags);
> +   tail = skb_peek_tail(q);
> +   if (!tail ||
> +   SKB_EXT_ERR(tail)->ee.ee_origin != SO_EE_ORIGIN_ZEROCOPY)  {
> +   __skb_queue_tail(q, skb);
> +   skb = NULL;
> +   }

This drops the packet if the branch is not taken. In the TCP case this condition
means that we can try to coalesce packets, but that is not the case here.

> +   spin_unlock_irqrestore(&q->lock, flags);
> +   sk->sk_error_report(sk);
> +   consume_skb(skb);
> +}
> +
>  /*
>   * This relies on dma_map_sg() not touching sg[].page during merging.
>   */
> @@ -66,11 +127,15 @@ static void rds_message_purge(struct rds_message *rm)
> for (i = 0; i < rm->data.op_nents; i++) {
> rdsdebug("putting data page %p\n", (void 
> *)sg_page(&rm->data.op_sg[i]));
> /* XXX will have to put_page for page refs */
> -   __free_page(sg_page(&rm->data.op_sg[i]));
> +   if (!rm->data.op_zcopy)
> +   __free_page(sg_page(&rm->data.op_sg[i]));
> +   else
> +   put_page(sg_page(&rm->data.op_sg[i]));
> }
> rm->data.op_nents = 0;
> spin_lock_irqsave(&rm->m_rs_lock, flags);
> if (rm->m_rs) {
> +   rds_rm_zerocopy_callback(rm->m_rs);
> sock_put(rds_rs_to_sk(rm->m_rs));
> rm->m_rs = NULL;
> }
> diff --git a/net/rds/rds.h b/net/rds/rds.h
> index 374ae83..de5015a 100644
> --- a/net/rds/rds.h
> +++ b/net/rds/rds.h
> @@ -356,6 +356,12 @@ static inline u32 
> rds_rdma_cookie_offset(rds_rdma_cookie_t cookie)
>  #define RDS_MSG_PAGEVEC7
>

Re: [PATCH net-next] net: core: Expose number of link up/down transitions

2018-01-17 Thread Andrei Vagin
On Thu, Jan 18, 2018 at 01:06:52AM +0100, Andrew Lunn wrote:
> > What is the idea to have two separate counters? Can a delta between them
> > be a bigger than 1?
> 
> Yes, it can.
> 
> These counters are incremented in netif_carrier_on() /
> netif_carrier_off(). They are not always called in pairs and they can
> be called multiple times for the same event. The phylib will call them
> when it notices the PHY saying the link is down/up, and the MAC driver
> sometimes also calls them.

We check the __LINK_STATE_NOCARRIER bit before changing these counters,
so if we call netif_carrier_on() twice, the counter will be incrimented
only by one, will it not?

void netif_carrier_on(struct net_device *dev)
if (test_and_clear_bit(__LINK_STATE_NOCARRIER, &dev->state)) {
atomic_inc(&dev->carrier_changes);

...

void netif_carrier_off(struct net_device *dev)
if (!test_and_set_bit(__LINK_STATE_NOCARRIER, &dev->state)) {
atomic_inc(&dev->carrier_changes);


> 
> Andrew


Re: dangers of bots on the mailing lists was Re: divide error in ___bpf_prog_run

2018-01-17 Thread Alexei Starovoitov
On Wed, Jan 17, 2018 at 03:47:35PM -0500, Theodore Ts'o wrote:
> On Wed, Jan 17, 2018 at 12:09:18PM +0100, Dmitry Vyukov wrote:
> > On Wed, Jan 17, 2018 at 10:49 AM, Daniel Borkmann  
> > wrote:
> > > Don't know if there's such a possibility, but it would be nice if we could
> > > target fuzzing for specific subsystems in related subtrees directly (e.g.
> > > for bpf in bpf and bpf-next trees as one example). Dmitry?
> > 
> > Hi Daniel,
> > 
> > It's doable.
> > Let's start with one bpf tree. Will it be bpf or bpf-next? Which one
> > contains more ongoing work? What's the exact git repo address/branch,
> > so that I don't second guess?
> 
> As a suggestion, until the bpf subsystem is free from problems that
> can be found by Syzkaller in Linus's upstream tree, maybe it's not
> worth trying to test individual subsystem trees such as the bpf tree?
> After all, there's no point trying to bisect our way checking to see
> if the problem is with a newly added commit in a development tree, if
> it turns out the problem was first introduced years ago in the 4.1 or
> 3.19 timeframe.
> 
> After all, finding these older problems is going to have much higher
> value, since these are the sorts of potential security problems that
> are worth backporting to real device kernels for Android/ChromeOS, and
> for enterprise distro kernels.  So from an "impact to the industry"
> perspective, focusing on Linus's tree is going to be far more
> productive.  That's a win for the community, and it's a win for those
> people on the Syzkaller team who might be going up for promo or
> listing their achievements at performance review time.  :-)

all correct, but if there is capacity in syzkaller server farm
to test bpf and bpf-next trees it will be huge win for everyone as well.
For example in the recent speculation fix we missed integer overflow case
and it was found by syzkaller only when the patches landed in net tree.
We did quick follow up patch, but it caused double work for
us and all stable maintainers.
I think finding bugs in the development trees is just as important
as bugs in Linus's tree, since it improves quality of
patches before they reach mainline.

If syzkaller can only test one tree than linux-next should be the one.

There is some value of testing stable trees, but any developer
will first ask for a reproducer in the latest, so usefulness of
reporting such bugs will be limited.



Re: [PATCH net-next] net: core: Expose number of link up/down transitions

2018-01-17 Thread David Ahern
On 1/17/18 3:52 PM, Jiri Pirko wrote:
> Thu, Jan 18, 2018 at 12:06:57AM CET, f.faine...@gmail.com wrote:
>> From: David Decotigny 
>>
>> Expose the number of times the link has been going UP or DOWN, and
>> update the "carrier_changes" counter to be the sum of these two events.
>> While at it, also update the sysfs-class-net documentation to cover:
>> carrier_changes (3.15), count_link_up (4.16) and count_link_down (4.16)
>>
>> Signed-off-by: David Decotigny 
> 
> [...]
> 
> 
>> @@ -161,6 +161,8 @@ enum {
>>  IFLA_EVENT,
>>  IFLA_NEW_NETNSID,
>>  IFLA_IF_NETNSID,
>> +IFLA_COUNT_LINK_UP,
>> +IFLA_COUNT_LINK_DOWN,
> 
> 
>   IFLA_LINK_UP_COUNT,
>   IFLA_LINK_DOWN_COUNT,
> 
> would sound a bit nicer to me.

given existing IFLA_LINK_* attributes how about
IFLA_CARRIER_{UP,DOWN}_COUNT?


[PATCH bpf v2] bpf: mark dst unknown on inconsistent {s,u}bounds adjustments

2018-01-17 Thread Daniel Borkmann
syzkaller generated a BPF proglet and triggered a warning with
the following:

  0: (b7) r0 = 0
  1: (d5) if r0 s<= 0x0 goto pc+0
   R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
  2: (1f) r0 -= r1
   R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
  verifier internal error: known but bad sbounds

What happens is that in the first insn, r0's min/max value
are both 0 due to the immediate assignment, later in the jsle
test the bounds are updated for the min value in the false
path, meaning, they yield smin_val = 1, smax_val = 0, and when
ctx pointer is subtracted from r0, verifier bails out with the
internal error and throwing a WARN since smin_val != smax_val
for the known constant.

For min_val > max_val scenario it means that reg_set_min_max()
and reg_set_min_max_inv() (which both refine existing bounds)
demonstrated that such branch cannot be taken at runtime.

In above scenario for the case where it will be taken, the
existing [0, 0] bounds are kept intact. Meaning, the rejection
is not due to a verifier internal error, and therefore the
WARN() is not necessary either.

We could just reject such cases in adjust_{ptr,scalar}_min_max_vals()
when either known scalars have smin_val != smax_val or
umin_val != umax_val or any scalar reg with bounds
smin_val > smax_val or umin_val > umax_val. However, there
may be a small risk of breakage of buggy programs, so handle
this more gracefully and in adjust_{ptr,scalar}_min_max_vals()
just taint the dst reg as unknown scalar when we see ops with
such kind of src reg.

Reported-by: syzbot+6d362cadd45dc0a12...@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann 
---
 kernel/bpf/verifier.c   |  27 +++---
 tools/testing/selftests/bpf/test_verifier.c | 123 +++-
 2 files changed, 138 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index eb062b0..13551e6 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1895,17 +1895,13 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
 
dst_reg = ®s[dst];
 
-   if (WARN_ON_ONCE(known && (smin_val != smax_val))) {
-   print_verifier_state(env, env->cur_state);
-   verbose(env,
-   "verifier internal error: known but bad sbounds\n");
-   return -EINVAL;
-   }
-   if (WARN_ON_ONCE(known && (umin_val != umax_val))) {
-   print_verifier_state(env, env->cur_state);
-   verbose(env,
-   "verifier internal error: known but bad ubounds\n");
-   return -EINVAL;
+   if ((known && (smin_val != smax_val || umin_val != umax_val)) ||
+   smin_val > smax_val || umin_val > umax_val) {
+   /* Taint dst register if offset had invalid bounds derived from
+* e.g. dead branches.
+*/
+   __mark_reg_unknown(dst_reg);
+   return 0;
}
 
if (BPF_CLASS(insn->code) != BPF_ALU64) {
@@ -2097,6 +2093,15 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
src_known = tnum_is_const(src_reg.var_off);
dst_known = tnum_is_const(dst_reg->var_off);
 
+   if ((src_known && (smin_val != smax_val || umin_val != umax_val)) ||
+   smin_val > smax_val || umin_val > umax_val) {
+   /* Taint dst register if offset had invalid bounds derived from
+* e.g. dead branches.
+*/
+   __mark_reg_unknown(dst_reg);
+   return 0;
+   }
+
if (!src_known &&
opcode != BPF_ADD && opcode != BPF_SUB && opcode != BPF_AND) {
__mark_reg_unknown(dst_reg);
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 67e7c41..5ed4175 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -6732,7 +6732,7 @@ static struct bpf_test tests[] = {
BPF_JMP_IMM(BPF_JA, 0, 0, -7),
},
.fixup_map1 = { 4 },
-   .errstr = "unbounded min value",
+   .errstr = "R0 invalid mem access 'inv'",
.result = REJECT,
},
{
@@ -8634,6 +8634,127 @@ static struct bpf_test tests[] = {
.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
+   "check deducing bounds from const, 1",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_JMP_IMM(BPF_JSGE, BPF_REG_0, 1, 0),
+   BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_1),
+   BPF_EXIT_INSN(),
+   },
+   .result = REJECT,
+   .errstr = "R0 tried to subtract pointer from scalar",
+   },
+   {
+   "check deducing bounds from const, 2",
+   .insns = {
+   BPF_MOV64_IMM(BPF_REG

Re: [RFC PATCH net-next 02/12] selftests: forwarding: Add a test for FDB learning

2018-01-17 Thread David Ahern
On 1/17/18 3:31 PM, Jiri Pirko wrote:
> Wed, Jan 17, 2018 at 11:59:10PM CET, ro...@cumulusnetworks.com wrote:
>> On Wed, Jan 17, 2018 at 2:46 PM, Roopa Prabhu  
>> wrote:
>>> On Wed, Jan 17, 2018 at 1:01 PM, Jiri Pirko  wrote:
 Wed, Jan 17, 2018 at 09:48:54PM CET, dsah...@gmail.com wrote:
> On 1/15/18 11:18 AM, Ido Schimmel wrote:
>> diff --git a/tools/testing/selftests/forwarding/lib.sh 
>> b/tools/testing/selftests/forwarding/lib.sh
>> index bb423371f4de..264bf0af4c4d 100644
>> --- a/tools/testing/selftests/forwarding/lib.sh
>> +++ b/tools/testing/selftests/forwarding/lib.sh
>> @@ -22,6 +22,11 @@ if [[ ! -x "$(command -v jq)" ]]; then
>>  exit 0
>>  fi
>>
>> +if [[ ! -x "$(command -v mausezahn)" ]]; then
>> +echo "SKIP: mausezahn not installed"
>> +exit 0
>> +fi
>> +
>
> The checks are good, but hitting a collision with this one:
> $   ./router.sh
> SKIP: mausezahn not installed
>
> For debian, it is mz not mausezahn.

 That is weird. If you look at the sources, the binary name is
 "mausezahn". Looks like debian is doing some renaming :/

>>>
>>> I have seen/used both versions. Debian packages from some old upstream
>>> source which calls it mz.
>>> https://packages.debian.org/sid/main/mz
>>>
>>> I have used latest mz (with ipv6 support etc) from netsniff-ng which
>>> builds it as mausezahn.
>>
>>
>> actually Debian also builds
>> https://packages.debian.org/sid/netsniff-ng and installs 'mausezahn'
>>
>> so you might have to check and use which exists... (mausezahn overriding mz).
> 
> I believe we can stick with mausezahn, since it is apparently
> available on Debian too.
> 

How about the attached - make ping, ping6 and mz variables? Default to
ping and mausezahn but let users update that name if necessary. Comments
in the sample config to the effect need to be added to help users.
diff --git a/tools/testing/selftests/forwarding/bridge.sh 
b/tools/testing/selftests/forwarding/bridge.sh
index 7ae4f1639c45..6a34489184ea 100755
--- a/tools/testing/selftests/forwarding/bridge.sh
+++ b/tools/testing/selftests/forwarding/bridge.sh
@@ -101,13 +101,13 @@ trap cleanup EXIT
 setup_prepare
 setup_wait
 
-ping_test "vrf-h1" 192.0.2.2
-ping_test "vrf-h1" 2001:db8:1::2
+ping_test ${PING} "vrf-h1" 192.0.2.2
+ping_test ${PING6} "vrf-h1" 2001:db8:1::2
 
 old_mtu=$(mtu_get $h1)
 mtu_change 9000 "${netifs_arr[@]}"
-ping_test "vrf-h1" 192.0.2.2
-ping_test "vrf-h1" 2001:db8:1::2
+ping_test ${PING} "vrf-h1" 192.0.2.2
+ping_test ${PING6} "vrf-h1" 2001:db8:1::2
 mtu_change $old_mtu "${netifs_arr[@]}"
 
 learning_test "br0" $swp1 1 $h1
diff --git a/tools/testing/selftests/forwarding/forwarding.config.sample 
b/tools/testing/selftests/forwarding/forwarding.config.sample
index f2b14814e4ba..78262e3b7523 100644
--- a/tools/testing/selftests/forwarding/forwarding.config.sample
+++ b/tools/testing/selftests/forwarding/forwarding.config.sample
@@ -17,3 +17,7 @@ declare -A OPTIONS
 
 # Time to wait after interfaces participating in the test are all UP.
 OPTIONS[wait_time]=5
+
+#PING=ping
+#PING6=ping
+#MZ=mausezahn
diff --git a/tools/testing/selftests/forwarding/lib.sh 
b/tools/testing/selftests/forwarding/lib.sh
index b9b049af93cb..774f5c68ad0b 100644
--- a/tools/testing/selftests/forwarding/lib.sh
+++ b/tools/testing/selftests/forwarding/lib.sh
@@ -22,8 +22,8 @@ if [[ ! -x "$(command -v jq)" ]]; then
exit 0
 fi
 
-if [[ ! -x "$(command -v mausezahn)" ]]; then
-   echo "SKIP: mausezahn not installed"
+if [[ ! -x "$(command -v ${MZ})" ]]; then
+   echo "SKIP: ${MZ} not installed"
exit 0
 fi
 
@@ -242,12 +242,13 @@ tc_offload_check()
 
 ping_test()
 {
-   local vrf_name=$1
-   local dip=$2
+   local ping=$1
+   local vrf_name=$2
+   local dip=$3
 
RET=0
 
-   ip vrf exec $vrf_name ping $dip -c 10 -i 0.1 -w 2 &> /dev/null
+   ip vrf exec $vrf_name $ping $dip -c 10 -i 0.1 -w 2 &> /dev/null
check_err $?
print_result "ping"
 }
@@ -266,7 +267,7 @@ learning_test()
| jq -e '.[] | select(.mac == "de:ad:be:ef:13:37")' &> /dev/null
check_fail $? "found FDB record when should not"
 
-   mausezahn $host_if -c 1 -p 64 -a de:ad:be:ef:13:37 -t ip -q
+   ${MZ} $host_if -c 1 -p 64 -a de:ad:be:ef:13:37 -t ip -q
 
bridge -j fdb show br $bridge brport $br_port vlan $vid \
| jq -e '.[] | select(.mac == "de:ad:be:ef:13:37")' &> /dev/null
@@ -283,7 +284,7 @@ learning_test()
 
bridge link set dev $br_port learning off
 
-   mausezahn $host_if -c 1 -p 64 -a de:ad:be:ef:13:37 -t ip -q
+   ${MZ} $host_if -c 1 -p 64 -a de:ad:be:ef:13:37 -t ip -q
 
bridge -j fdb show br $bridge brport $br_port vlan $vid \
| jq -e '.[] | select(.mac == "de:ad:be:ef:13:37")' &> /dev/null
@@ -309,7 +310,7 @@ flood_test_do()
tc filter add dev $host2_if ingress protocol ip pref 1 handle 101 \
 

Re: remove pci_dma_* abuses and workarounds V2

2018-01-17 Thread Bjorn Helgaas
[+cc David]

On Wed, Jan 10, 2018 at 07:03:18PM +0100, Christoph Hellwig wrote:
> Back before the dawn of time pci_dma_* with a NULL pci_dev argument
> was used for all kinds of things, e.g. dma mapping for non-PCI
> devices.  All this has been long removed, but it turns out we
> still care for a NULL pci_dev in the wrappers, and we still have
> two odd USB drivers that use pci_dma_alloc_consistent for allocating
> memory while ignoring the dma_addr_t entirely, and a network driver
> mixing the already wrong usage of dma_* with a NULL device with a
> single call to pci_free_consistent.
> 
> This series switches the two usb drivers to use plain kzalloc, the
> net driver to properly use the dma API and then removes the handling
> of the NULL pci_dev in the pci_dma_* wrappers.
> 
> Changes since V1:
>  - remove allocation failure printks
>  - use kcalloc
>  - fix tsi108_eth
>  - improve changelogs

Applied to pci/dma for v4.16, thanks!


Re: [PATCH 3/4] tsi108_eth: use dma API properly

2018-01-17 Thread Bjorn Helgaas
[+cc David, FYI, I plan to merge this via PCI along with the rest of
Christoph's series]

On Wed, Jan 10, 2018 at 07:03:21PM +0100, Christoph Hellwig wrote:
> We need to pass a struct device to the dma API, even if some
> architectures still support that for legacy reasons, and should not mix
> it with the old PCI dma API.
> 
> Note that the driver also seems to never actually unmap its dma mappings,
> but to fix that we'll need someone more familar with the driver.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/net/ethernet/tundra/tsi108_eth.c | 36 
> ++--
>  1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/net/ethernet/tundra/tsi108_eth.c 
> b/drivers/net/ethernet/tundra/tsi108_eth.c
> index 0624b71ab5d4..edcd1e60b30d 100644
> --- a/drivers/net/ethernet/tundra/tsi108_eth.c
> +++ b/drivers/net/ethernet/tundra/tsi108_eth.c
> @@ -152,6 +152,8 @@ struct tsi108_prv_data {
>   u32 msg_enable; /* debug message level */
>   struct mii_if_info mii_if;
>   unsigned int init_media;
> +
> + struct platform_device *pdev;
>  };
>  
>  /* Structure for a device driver */
> @@ -703,17 +705,18 @@ static int tsi108_send_packet(struct sk_buff * skb, 
> struct net_device *dev)
>   data->txskbs[tx] = skb;
>  
>   if (i == 0) {
> - data->txring[tx].buf0 = dma_map_single(NULL, skb->data,
> - skb_headlen(skb), DMA_TO_DEVICE);
> + data->txring[tx].buf0 = dma_map_single(&data->pdev->dev,
> + skb->data, skb_headlen(skb),
> + DMA_TO_DEVICE);
>   data->txring[tx].len = skb_headlen(skb);
>   misc |= TSI108_TX_SOF;
>   } else {
>   const skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1];
>  
> - data->txring[tx].buf0 = skb_frag_dma_map(NULL, frag,
> -  0,
> -  
> skb_frag_size(frag),
> -  DMA_TO_DEVICE);
> + data->txring[tx].buf0 =
> + skb_frag_dma_map(&data->pdev->dev, frag,
> + 0, skb_frag_size(frag),
> + DMA_TO_DEVICE);
>   data->txring[tx].len = skb_frag_size(frag);
>   }
>  
> @@ -808,9 +811,9 @@ static int tsi108_refill_rx(struct net_device *dev, int 
> budget)
>   if (!skb)
>   break;
>  
> - data->rxring[rx].buf0 = dma_map_single(NULL, skb->data,
> - TSI108_RX_SKB_SIZE,
> - DMA_FROM_DEVICE);
> + data->rxring[rx].buf0 = dma_map_single(&data->pdev->dev,
> + skb->data, TSI108_RX_SKB_SIZE,
> + DMA_FROM_DEVICE);
>  
>   /* Sometimes the hardware sets blen to zero after packet
>* reception, even though the manual says that it's only ever
> @@ -1308,15 +1311,15 @@ static int tsi108_open(struct net_device *dev)
>  data->id, dev->irq, dev->name);
>   }
>  
> - data->rxring = dma_zalloc_coherent(NULL, rxring_size, &data->rxdma,
> -GFP_KERNEL);
> + data->rxring = dma_zalloc_coherent(&data->pdev->dev, rxring_size,
> + &data->rxdma, GFP_KERNEL);
>   if (!data->rxring)
>   return -ENOMEM;
>  
> - data->txring = dma_zalloc_coherent(NULL, txring_size, &data->txdma,
> -GFP_KERNEL);
> + data->txring = dma_zalloc_coherent(&data->pdev->dev, txring_size,
> + &data->txdma, GFP_KERNEL);
>   if (!data->txring) {
> - pci_free_consistent(NULL, rxring_size, data->rxring,
> + dma_free_coherent(&data->pdev->dev, rxring_size, data->rxring,
>   data->rxdma);
>   return -ENOMEM;
>   }
> @@ -1428,10 +1431,10 @@ static int tsi108_close(struct net_device *dev)
>   dev_kfree_skb(skb);
>   }
>  
> - dma_free_coherent(0,
> + dma_free_coherent(&data->pdev->dev,
>   TSI108_RXRING_LEN * sizeof(rx_desc),
>   data->rxring, data->rxdma);
> - dma_free_coherent(0,
> + dma_free_coherent(&data->pdev->dev,
>   TSI108_TXRING_LEN * sizeof(tx_desc),
>   data->txring, data->txdma);
>  
> @@ -1576,6 +1579,7 @@ tsi108_init_one(struct platform_device *pdev)
>   printk("tsi108_eth%d: probe...\n", pdev->id);
>   data = netdev_priv(dev);
>   data->dev = dev;
> + 

Re: [PATCH net-next] net: core: Expose number of link up/down transitions

2018-01-17 Thread Andrew Lunn
> What is the idea to have two separate counters? Can a delta between them
> be a bigger than 1?

Yes, it can.

These counters are incremented in netif_carrier_on() /
netif_carrier_off(). They are not always called in pairs and they can
be called multiple times for the same event. The phylib will call them
when it notices the PHY saying the link is down/up, and the MAC driver
sometimes also calls them.

  Andrew


Re: [PATCH RFC net-next 4/6] sock: permit SO_ZEROCOPY on PF_RDS socket

2018-01-17 Thread Willem de Bruijn
On Wed, Jan 17, 2018 at 7:20 AM, Sowmini Varadhan
 wrote:
> allow the application to set SO_ZEROCOPY on the underlying sk
> of a PF_RDS socket
>
> Signed-off-by: Sowmini Varadhan 
> ---
>  net/core/sock.c |7 +++
>  1 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 4f52677..f0f44b0 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1049,6 +1049,13 @@ int sock_setsockopt(struct socket *sock, int level, 
> int optname,
> break;
>
> case SO_ZEROCOPY:
> +   if (sk->sk_family == PF_RDS) {
> +   if (val < 0 || val > 1)
> +   ret = -EINVAL;
> +   else
> +   sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);
> +   break;
> +   }

Let's integrate this in the existing logic. Perhaps something like

  if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) {
if (sk->sk_protocol != IPPROTO_TCP)
  ret = -ENOTSUPP;
else if (sk->sk_state != TCP_CLOSE)
  ret = -EBUSY;
  } else if (sk->sk_protocol != PF_RDS) {
ret = -ENOTSUPP;
  }

  if (!ret) {
if (val < 0 || val > 1)
  ret = -EINVAL;
else
  sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);
  }


Re: [PATCH net-next] net: core: Expose number of link up/down transitions

2018-01-17 Thread Jiri Pirko
Thu, Jan 18, 2018 at 12:06:57AM CET, f.faine...@gmail.com wrote:
>From: David Decotigny 
>
>Expose the number of times the link has been going UP or DOWN, and
>update the "carrier_changes" counter to be the sum of these two events.
>While at it, also update the sysfs-class-net documentation to cover:
>carrier_changes (3.15), count_link_up (4.16) and count_link_down (4.16)
>
>Signed-off-by: David Decotigny 

[...]


>@@ -161,6 +161,8 @@ enum {
>   IFLA_EVENT,
>   IFLA_NEW_NETNSID,
>   IFLA_IF_NETNSID,
>+  IFLA_COUNT_LINK_UP,
>+  IFLA_COUNT_LINK_DOWN,


IFLA_LINK_UP_COUNT,
IFLA_LINK_DOWN_COUNT,

would sound a bit nicer to me.

>   __IFLA_MAX
> };


Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue

2018-01-17 Thread Willem de Bruijn
On Wed, Jan 17, 2018 at 7:19 AM, Sowmini Varadhan
 wrote:
> Allow the application the ability to use MSG_PEEK with sk_error_queue
> so that it can peek and re-read message in cases where MSG_TRUNC
> may be encountered.
>
> Signed-off-by: Sowmini Varadhan 

>  int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len,
> -  int level, int type)
> +  int level, int type, int flags)
>  {
> struct sock_exterr_skb *serr;
> struct sk_buff *skb;
> @@ -2916,7 +2916,10 @@ int sock_recv_errqueue(struct sock *sk, struct msghdr 
> *msg, int len,
> err = copied;
>
>  out_free_skb:
> -   kfree_skb(skb);
> +   if (likely(!(flags & MSG_PEEK)))
> +   kfree_skb(skb);
> +   else
> +   skb_queue_head(&sk->sk_error_queue, skb);

This can cause reordering with parallel readers. Can we avoid the need
for peeking? It also caused a slew of subtle bugs previously.

How about just define a max number of cookies and require the caller
to always read with sufficient room to hold them?


Re: [PATCH net-next] net: core: Expose number of link up/down transitions

2018-01-17 Thread Andrei Vagin
On Wed, Jan 17, 2018 at 03:06:57PM -0800, Florian Fainelli wrote:
> From: David Decotigny 
> 
> Expose the number of times the link has been going UP or DOWN, and
> update the "carrier_changes" counter to be the sum of these two events.
> While at it, also update the sysfs-class-net documentation to cover:
> carrier_changes (3.15), count_link_up (4.16) and count_link_down (4.16)

What is the idea to have two separate counters? Can a delta between them
be a bigger than 1?

> 
> Signed-off-by: David Decotigny 
> [Florian:
> * rebase
> * add documentation
> * merge carrier_changes with up/down counters]
> Signed-off-by: Florian Fainelli 
> ---
>  Documentation/ABI/testing/sysfs-class-net | 24 
>  include/linux/netdevice.h |  6 --
>  include/uapi/linux/if_link.h  |  2 ++
>  net/core/net-sysfs.c  | 23 ++-
>  net/core/rtnetlink.c  | 13 +++--
>  net/sched/sch_generic.c   |  4 ++--
>  6 files changed, 65 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-class-net 
> b/Documentation/ABI/testing/sysfs-class-net
> index 6856da99b6f7..e4b0d5157305 100644
> --- a/Documentation/ABI/testing/sysfs-class-net
> +++ b/Documentation/ABI/testing/sysfs-class-net
> @@ -259,3 +259,27 @@ Contact: netdev@vger.kernel.org
>  Description:
>   Symbolic link to the PHY device this network device is attached
>   to.
> +
> +What:/sys/class/net/ +Date:Mar 2014
> +KernelVersion:   3.15
> +Contact: netdev@vger.kernel.org
> +Description:
> + 32-bit unsigned integer counting the number of times the link 
> has
> + seen a change from UP to DOWN and vice versa
> +
> +What:/sys/class/net/ +Date:Jan 2018
> +KernelVersion:   4.16
> +Contact: netdev@vger.kernel.org
> +Description:
> + 32-bit unsigned integer counting the number of times the link 
> has
> + been up
> +
> +What:/sys/class/net/ +Date:Jan 2018
> +KernelVersion:   4.16
> +Contact: netdev@vger.kernel.org
> +Description:
> + 32-bit unsigned integer counting the number of times the link 
> has
> + been down
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index ed0799a12bf2..28f68f7513d0 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1680,8 +1680,6 @@ struct net_device {
>   unsigned long   base_addr;
>   int irq;
>  
> - atomic_tcarrier_changes;
> -
>   /*
>*  Some hardware also needs these fields (state,dev_list,
>*  napi_list,unreg_list,close_list) but they are not
> @@ -1719,6 +1717,10 @@ struct net_device {
>   atomic_long_t   tx_dropped;
>   atomic_long_t   rx_nohandler;
>  
> + /* Stats to monitor link on/off, flapping */
> + atomic_tcount_link_up;
> + atomic_tcount_link_down;
> +
>  #ifdef CONFIG_WIRELESS_EXT
>   const struct iw_handler_def *wireless_handlers;
>   struct iw_public_data   *wireless_data;
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index f8f04fed6186..6e44b0674ba4 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -161,6 +161,8 @@ enum {
>   IFLA_EVENT,
>   IFLA_NEW_NETNSID,
>   IFLA_IF_NETNSID,
> + IFLA_COUNT_LINK_UP,
> + IFLA_COUNT_LINK_DOWN,
>   __IFLA_MAX
>  };
>  
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 7bf8b85ade16..9f732c3dc2ce 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -295,10 +295,29 @@ static ssize_t carrier_changes_show(struct device *dev,
>   struct net_device *netdev = to_net_dev(dev);
>  
>   return sprintf(buf, fmt_dec,
> -atomic_read(&netdev->carrier_changes));
> +atomic_read(&netdev->count_link_up) +
> +atomic_read(&netdev->count_link_down));
>  }
>  static DEVICE_ATTR_RO(carrier_changes);
>  
> +static ssize_t count_link_up_show(struct device *dev,
> +   struct device_attribute *attr, char *buf)
> +{
> + struct net_device *netdev = to_net_dev(dev);
> +
> + return sprintf(buf, fmt_dec, atomic_read(&netdev->count_link_up));
> +}
> +static DEVICE_ATTR_RO(count_link_up);
> +
> +static ssize_t count_link_down_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct net_device *netdev = to_net_dev(dev);
> +
> + return sprintf(buf, fmt_dec, atomic_read(&netdev->count_link_down));
> +}
> +static DEVICE_ATTR_RO(count_link_down);
> +
>  /* read-write attributes */
>  
>  static int change_mtu(struct net_device *dev, unsigned long new_mtu)
> @@ -547,6 +566,

Re: [patch net-next v11 00/13] net: sched: allow qdiscs to share filter block instances

2018-01-17 Thread Jiri Pirko
Wed, Jan 17, 2018 at 10:33:32PM CET, da...@davemloft.net wrote:
>From: Jiri Pirko 
>Date: Wed, 17 Jan 2018 21:18:07 +0100
>
>> Odd. I don't see this warning with:
>> gcc version 6.4.1 20170727 (Red Hat 6.4.1-1) (GCC)
>
>Yeah, for reference I'm using:
>
>gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)

Interesting, I upgraded to the same version:
gcc version 7.2.1 20170915 (Red Hat 7.2.1-2) (GCC)

I still don't see the warning :O
what is your cmdline?


Re: [RFC PATCH net-next 02/12] selftests: forwarding: Add a test for FDB learning

2018-01-17 Thread Jiri Pirko
Wed, Jan 17, 2018 at 11:59:10PM CET, ro...@cumulusnetworks.com wrote:
>On Wed, Jan 17, 2018 at 2:46 PM, Roopa Prabhu  
>wrote:
>> On Wed, Jan 17, 2018 at 1:01 PM, Jiri Pirko  wrote:
>>> Wed, Jan 17, 2018 at 09:48:54PM CET, dsah...@gmail.com wrote:
On 1/15/18 11:18 AM, Ido Schimmel wrote:
> diff --git a/tools/testing/selftests/forwarding/lib.sh 
> b/tools/testing/selftests/forwarding/lib.sh
> index bb423371f4de..264bf0af4c4d 100644
> --- a/tools/testing/selftests/forwarding/lib.sh
> +++ b/tools/testing/selftests/forwarding/lib.sh
> @@ -22,6 +22,11 @@ if [[ ! -x "$(command -v jq)" ]]; then
>  exit 0
>  fi
>
> +if [[ ! -x "$(command -v mausezahn)" ]]; then
> +echo "SKIP: mausezahn not installed"
> +exit 0
> +fi
> +

The checks are good, but hitting a collision with this one:
$   ./router.sh
SKIP: mausezahn not installed

For debian, it is mz not mausezahn.
>>>
>>> That is weird. If you look at the sources, the binary name is
>>> "mausezahn". Looks like debian is doing some renaming :/
>>>
>>
>> I have seen/used both versions. Debian packages from some old upstream
>> source which calls it mz.
>> https://packages.debian.org/sid/main/mz
>>
>> I have used latest mz (with ipv6 support etc) from netsniff-ng which
>> builds it as mausezahn.
>
>
>actually Debian also builds
>https://packages.debian.org/sid/netsniff-ng and installs 'mausezahn'
>
>so you might have to check and use which exists... (mausezahn overriding mz).

I believe we can stick with mausezahn, since it is apparently
available on Debian too.



Re: [PATCH net] net: validate untrusted gso packets

2018-01-17 Thread Willem de Bruijn
> From: Willem de Bruijn
>
> Validate gso packet type and headers on kernel entry. Reuse the info
> gathered by skb_probe_transport_header.
>
> Syzbot found two bugs by passing bad gso packets in packet sockets.
> Untrusted user packets are limited to a small set of gso types in
> virtio_net_hdr_to_skb. But segmentation occurs on packet contents.
> Syzkaller was able to enter gso callbacks that are not hardened
> against untrusted user input.


 Do this mean there's something missed in exist header check for dodgy
 packets?
>>>
>>> virtio_net_hdr_to_skb checks gso_type, but it does not verify that this
>>> type correctly describes the actual packet. Segmentation happens based
>>> on packet contents. So a packet was crafted to enter sctp gso, even
>>> though no such gso_type exists. This issue is not specific to sctp.
>>
>>
>> So it looks to me we should do it in here in sctp_gso_segment().
>>
>> if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
>> /* Packet is from an untrusted source, reset gso_segs. */
>
> No dodgy source can legitimately generate sctp code, so it should not
> even get there. Also, a packet can just as easily spoof an esp packet.
> See also the discussion in the Link above.
>
> We can address this specific issue in segmentation by placing a check
> analogous to skb_validate_dodgy_gso in the network layer. Something
> like this (but with ECN option support):
>
> @@ -1258,6 +1258,22 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb,
>
> skb_reset_transport_header(skb);
>
> +   gso_type = skb_shinfo(skb)->gso_type;
> +   if (gso_type & SKB_GSO_DODGY) {
> +   switch (gso_type & (SKB_GSO_TCPV4 | SKB_GSO_UDP)) {
> +   case SKB_GSO_TCPV4:
> +   if (proto != IPPROTO_TCP)
> +   goto out;
> +   break;
> +   case SKB_GSO_UDP:
> +   if (proto != IPPROTO_UDP)
> +   goto out;
> +   break;
> +   default:
> +   goto out;
> +   }
> +   }

Okay, I sent this instead: http://patchwork.ozlabs.org/patch/862643/


Re: [PATCH net-next] net: core: Expose number of link up/down transitions

2018-01-17 Thread Jakub Kicinski
On Wed, 17 Jan 2018 15:06:57 -0800, Florian Fainelli wrote:
> +What:/sys/class/net// ?


Re: [PATCH net-next] net: stmmac: Fix reception of Broadcom switches tags

2018-01-17 Thread Florian Fainelli
Hi Peppe,

On 01/16/2018 11:06 PM, Giuseppe CAVALLARO wrote:
> Hi Florian
> 
> for gmac4.x and gmac3.x series the ACS bit is the Automatic Pad or CRC
> Stripping, so the
> core strips the Pad or FCS on frames if the value of the length field is
> < 1536 bytes.
> For MAC10-100 there is the Bit 8 (ASTP) of the reg0 that does the same
> if len is < 46bytes.

OK, thanks for the detail on the older core, I will update my v2 to take
that into account.

> In your patch I can just suggest to add a new field to strip the PAD/FCS
> w/o passing the whole
> netdev struct to the core_init. In the main driver, we could manage the
> pad-strip feature (also
> by using dt) or disable it in case of netdev_uses_dsa; then propagating
> this setting to the core_init
> or calling a new callback. What do you think?

I don't think adding a DT property is a good idea, because we can
determine whether ACS is necessary or not at runtime when DSA with
proprietary Ethernet switch tags is used or not using netdev_uses_dsa().

Having the DT property to configure whether ACS is turned on or off
means that the Device Tree now describes a policy, rather than a
capability, which is counter to the Device Tree spirit. Besides, this
leaves room for making mistake, and this creates a support burden. When
interfaced with a switch, the switch is guaranteed to egress 64 bytes
(including FCS) packets towards stmmac, and the only thing we need to do
is ensure that stmmac does the same thing. This is a small penalty to
pay for small packets, and this is done only when DSA is used.

Regarding passing a net_device to the core_init functions, this was done
because dev->mtu was already accessed, and we might need to access
additional net_device members in the future, 2 means we need to find a
solution that scales, so passing the net_device around was a natural
choice. Function pointers described in common.h are not exactly
consistent in what type of arguments they take, so I did not see this as
being a big problem.

Thank you.

> 
> Regards
> Peppe
> 
> On 1/17/2018 12:25 AM, Florian Fainelli wrote:
>> Broadcom tags inserted by Broadcom switches put a 4 byte header after
>> the MAC SA and before the EtherType, which may look like some sort of 0
>> length LLC/SNAP packet (tcpdump and wireshark do think that way). With
>> ACS enabled in stmmac the packets were truncated to 8 bytes on
>> reception, whereas clearing this bit allowed normal reception to occur.
>>
>> In order to make that possible, we need to pass a net_device argument to
>> the different core_init() functions and we are dependent on the Broadcom
>> tagger padding packets correctly (which it now does). To be as little
>> invasive as possible, this is only done for gmac1000 when the network
>> device is DSA-enabled (netdev_uses_dsa() returns true).
>>
>> Signed-off-by: Florian Fainelli 
>> ---
>>   drivers/net/ethernet/stmicro/stmmac/common.h |  2 +-
>>   drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c    |  3 ++-
>>   drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c | 12 +++-
>>   drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c  |  3 ++-
>>   drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c    | 11 ++-
>>   drivers/net/ethernet/stmicro/stmmac/stmmac_main.c    |  2 +-
>>   6 files changed, 27 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h
>> b/drivers/net/ethernet/stmicro/stmmac/common.h
>> index ce2ea2d491ac..2ffe76c0ff74 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/common.h
>> +++ b/drivers/net/ethernet/stmicro/stmmac/common.h
>> @@ -474,7 +474,7 @@ struct mac_device_info;
>>   /* Helpers to program the MAC core */
>>   struct stmmac_ops {
>>   /* MAC core initialization */
>> -    void (*core_init)(struct mac_device_info *hw, int mtu);
>> +    void (*core_init)(struct mac_device_info *hw, struct net_device
>> *dev);
>>   /* Enable the MAC RX/TX */
>>   void (*set_mac)(void __iomem *ioaddr, bool enable);
>>   /* Enable and verify that the IPC module is supported */
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>> b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>> index 9eb7f65d8000..a3fa65b1ca8e 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>> @@ -483,7 +483,8 @@ static int sun8i_dwmac_init(struct platform_device
>> *pdev, void *priv)
>>   return 0;
>>   }
>>   -static void sun8i_dwmac_core_init(struct mac_device_info *hw, int mtu)
>> +static void sun8i_dwmac_core_init(struct mac_device_info *hw,
>> +  struct net_device *dev)
>>   {
>>   void __iomem *ioaddr = hw->pcsr;
>>   u32 v;
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
>> b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
>> index 8a86340ff2d3..540d21786a43 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac10

[RFC iproute2] tc: Add support for the TBS Qdisc

2018-01-17 Thread Jesus Sanchez-Palencia
From: Vinicius Costa Gomes 

The Time Based Scheduler (TBS) queueing discipline allows precise
control of the transmission time of packets.

The syntax is:

tc qdisc add dev DEV parent NODE tbs delta 
 clockid  offload <1|0>

Signed-off-by: Vinicius Costa Gomes 
---
 include/uapi/linux/pkt_sched.h |  17 ++
 tc/Makefile|   1 +
 tc/q_tbs.c | 119 +
 3 files changed, 137 insertions(+)
 create mode 100644 tc/q_tbs.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index af3cc2f4..514cb742 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -935,4 +935,21 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+
+/* TBS */
+struct tc_tbs_qopt {
+   __u8 offload;
+   __u8 _pad[3];
+   __s32 delta;
+   __s32 clockid;
+};
+
+enum {
+   TCA_TBS_UNSPEC,
+   TCA_TBS_PARMS,
+   __TCA_TBS_MAX,
+};
+
+#define TCA_TBS_MAX (__TCA_TBS_MAX - 1)
+
 #endif
diff --git a/tc/Makefile b/tc/Makefile
index 3716dd6a..3c87b0dc 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -71,6 +71,7 @@ TCMODULES += q_clsact.o
 TCMODULES += e_bpf.o
 TCMODULES += f_matchall.o
 TCMODULES += q_cbs.o
+TCMODULES += q_tbs.o
 
 TCSO :=
 ifeq ($(TC_CONFIG_ATM),y)
diff --git a/tc/q_tbs.c b/tc/q_tbs.c
new file mode 100644
index ..f4a73b58
--- /dev/null
+++ b/tc/q_tbs.c
@@ -0,0 +1,119 @@
+/*
+ * q_tbs.c TBS.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Vinicius Costa Gomes 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tc_util.h"
+
+static void explain(void)
+{
+   fprintf(stderr, "Usage: ... tbs delta NANOS guardband NANOS clockid 
CLOCKID [offload 0|1]\n");
+
+}
+
+static void explain1(const char *arg, const char *val)
+{
+   fprintf(stderr, "tbs: illegal value for \"%s\": \"%s\"\n", arg, val);
+}
+
+static int tbs_parse_opt(struct qdisc_util *qu, int argc,
+char **argv, struct nlmsghdr *n, const char *dev)
+{
+   struct tc_tbs_qopt opt = {};
+   struct rtattr *tail;
+
+   while (argc > 0) {
+   if (matches(*argv, "offload") == 0) {
+   NEXT_ARG();
+   if (opt.offload) {
+   fprintf(stderr, "tbs: duplicate \"offload\" 
specification\n");
+   return -1;
+   }
+   if (get_u8(&opt.offload, *argv, 0)) {
+   explain1("offload", *argv);
+   return -1;
+   }
+   } else if (matches(*argv, "delta") == 0) {
+   NEXT_ARG();
+   if (opt.delta) {
+   fprintf(stderr, "tbs: duplicate \"delta\" 
specification\n");
+   return -1;
+   }
+   if (get_s32(&opt.delta, *argv, 0)) {
+   explain1("hicredit", *argv);
+   return -1;
+   }
+   } else if (matches(*argv, "clockid") == 0) {
+   NEXT_ARG();
+   if (opt.clockid) {
+   fprintf(stderr, "tbs: duplicate \"clockid\" 
specification\n");
+   return -1;
+   }
+   if (get_s32(&opt.clockid, *argv, 0)) {
+   explain1("clockid", *argv);
+   return -1;
+   }
+   } else if (strcmp(*argv, "help") == 0) {
+   explain();
+   return -1;
+   } else {
+   fprintf(stderr, "tbs: unknown parameter \"%s\"\n", 
*argv);
+   explain();
+   return -1;
+   }
+   argc--; argv++;
+   }
+
+   tail = NLMSG_TAIL(n);
+   addattr_l(n, 1024, TCA_OPTIONS, NULL, 0);
+   addattr_l(n, 2024, TCA_TBS_PARMS, &opt, sizeof(opt));
+   tail->rta_len = (void *) NLMSG_TAIL(n) - (void *) tail;
+   return 0;
+}
+
+static int tbs_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
+{
+   struct rtattr *tb[TCA_TBS_MAX+1];
+   struct tc_tbs_qopt *qopt;
+
+   if (opt == NULL)
+   return 0;
+
+   parse_rtattr_nested(tb, TCA_TBS_MAX, opt);
+
+   if (tb[TCA_TBS_PARMS] == NULL)
+   return -1;
+
+   qopt = RTA_DATA(tb[TCA_TBS_PARMS]);
+   if (RTA_PAYLOAD(tb[TCA_TBS_PARMS])  < sizeof(*

[PATCH net] gso: validate gso_type if SKB_GSO_DODGY

2018-01-17 Thread Willem de Bruijn
From: Willem de Bruijn 

Validate gso_type of untrusted SKB_GSO_DODGY packets during
segmentation.

Untrusted user packets are limited to a small set of gso types in
virtio_net_hdr_to_skb. But segmentation occurs on packet contents.
Syzkaller was able to enter gso callbacks that are not hardened
against untrusted user input.

Fixes: f43798c27684 ("tun: Allow GSO using virtio_net_hdr")
Link: http://lkml.kernel.org/r/<001a1137452496ffc305617e5...@google.com>
Reported-by: syzbot+fee64147a25aecd48...@syzkaller.appspotmail.com
Signed-off-by: Willem de Bruijn 
---
 net/ipv4/af_inet.c | 21 +
 net/ipv6/ip6_offload.c | 23 ++-
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index f00499a46927..d5a36827f7b1 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1220,6 +1220,25 @@ int inet_sk_rebuild_header(struct sock *sk)
 }
 EXPORT_SYMBOL(inet_sk_rebuild_header);
 
+static bool inet_gso_validate_dodgy(struct sk_buff *skb, int ipproto)
+{
+   unsigned int gso_type = skb_shinfo(skb)->gso_type;
+
+   if (gso_type & SKB_GSO_DODGY) {
+   switch (gso_type & ~SKB_GSO_DODGY) {
+   case SKB_GSO_TCPV4:
+   case SKB_GSO_TCPV4 | SKB_GSO_TCP_ECN:
+   return ipproto == IPPROTO_TCP;
+   case SKB_GSO_UDP:
+   return ipproto == IPPROTO_UDP;
+   default:
+   return false;
+   }
+   }
+
+   return true;
+}
+
 struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 netdev_features_t features)
 {
@@ -1245,6 +1264,8 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 
id = ntohs(iph->id);
proto = iph->protocol;
+   if (!inet_gso_validate_dodgy(skb, proto))
+   goto out;
 
/* Warning: after this point, iph might be no longer valid */
if (unlikely(!pskb_may_pull(skb, ihl)))
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 4a87f9428ca5..662e16548104 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -55,6 +55,25 @@ static int ipv6_gso_pull_exthdrs(struct sk_buff *skb, int 
proto)
return proto;
 }
 
+static bool ipv6_gso_validate_dodgy(struct sk_buff *skb, int ipproto)
+{
+   unsigned int gso_type = skb_shinfo(skb)->gso_type;
+
+   if (gso_type & SKB_GSO_DODGY) {
+   switch (gso_type & ~SKB_GSO_DODGY) {
+   case SKB_GSO_TCPV6:
+   case SKB_GSO_TCPV6 | SKB_GSO_TCP_ECN:
+   return ipproto == IPPROTO_TCP;
+   case SKB_GSO_UDP:
+   return ipproto == IPPROTO_UDP;
+   default:
+   return false;
+   }
+   }
+
+   return true;
+}
+
 static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
netdev_features_t features)
 {
@@ -82,10 +101,12 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff 
*skb,
 
ipv6h = ipv6_hdr(skb);
__skb_pull(skb, sizeof(*ipv6h));
-   segs = ERR_PTR(-EPROTONOSUPPORT);
 
proto = ipv6_gso_pull_exthdrs(skb, ipv6h->nexthdr);
+   if (!ipv6_gso_validate_dodgy(skb, proto))
+   goto out;
 
+   segs = ERR_PTR(-EPROTONOSUPPORT);
if (skb->encapsulation &&
skb_shinfo(skb)->gso_type & (SKB_GSO_IPXIP4 | SKB_GSO_IPXIP6))
udpfrag = proto == IPPROTO_UDP && encap;
-- 
2.16.0.rc1.238.g530d649a79-goog



Re: [RFC PATCH net-next 00/12] selftests: forwarding: Add VRF-based tests

2018-01-17 Thread Andrew Lunn
> >> However, a similar kind of flexibility can be achieved by using VRFs and
> >> by looping the switch ports together. For example:
> >>
> >>  br0
> >>   +
> >>vrf-h1 |   vrf-h2
> >>  ++---+++
> >>  ||||
> >> 192.0.2.1/24 ++++ 192.0.2.2/24
> >>swp1 swp2 swp3 swp4
> >>  ++++
> >>  ||||
> >>  ++++
> >>

> Agreed this is really cool! For DSA enabled switches, we usually have a
> host that does the test sequencing and then execute commands remotely on
> the DUT, but we might be able to get such a similar framework up and
> running on the DUT itself without too much hassle.

I think the problem we will have is a lack of ports. Most DSA switches
have 4 or 5 ports. Given the need for two ports per bridge port, we
will be limited to bridges with just two members. That really limits
what sort of tests you can do.

But for top or rack switches, 16 ports, 8 loopback cables, that does
give interesting setups. If i were writing tests for that class of
routers, that would be the hardware setup i would define.

 Andrew


[RFC v2 net-next 03/10] net: ipv4: udp: Hook into time based transmission.

2018-01-17 Thread Jesus Sanchez-Palencia
From: Richard Cochran 

For udp packets, copy the desired future transmit time from the CMSG
cookie into the skb.

Signed-off-by: Richard Cochran 
Signed-off-by: Jesus Sanchez-Palencia 
---
 net/ipv4/udp.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 853321555a4e..89257422503a 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -926,6 +926,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t 
len)
}
 
ipc.sockc.tsflags = sk->sk_tsflags;
+   ipc.sockc.transmit_time = 0;
ipc.addr = inet->inet_saddr;
ipc.oif = sk->sk_bound_dev_if;
 
@@ -1027,8 +1028,10 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
  sizeof(struct udphdr), &ipc, &rt,
  msg->msg_flags);
err = PTR_ERR(skb);
-   if (!IS_ERR_OR_NULL(skb))
+   if (!IS_ERR_OR_NULL(skb)) {
+   skb->tstamp = ipc.sockc.transmit_time;
err = udp_send_skb(skb, fl4);
+   }
goto out;
}
 
-- 
2.15.1



[RFC v2 net-next 05/10] net/sched: Allow creating a Qdisc watchdog with other clocks

2018-01-17 Thread Jesus Sanchez-Palencia
From: Vinicius Costa Gomes 

This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be
passed, this allows other time references to be used when scheduling
the Qdisc to run.

Signed-off-by: Vinicius Costa Gomes 
---
 include/net/pkt_sched.h |  2 ++
 net/sched/sch_api.c | 11 +--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 815b92a23936..2466ea143d01 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -72,6 +72,8 @@ struct qdisc_watchdog {
struct Qdisc*qdisc;
 };
 
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
*qdisc,
+clockid_t clockid);
 void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc);
 void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires);
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index d512f49ee83c..d6136920b5c1 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -596,12 +596,19 @@ static enum hrtimer_restart qdisc_watchdog(struct hrtimer 
*timer)
return HRTIMER_NORESTART;
 }
 
-void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
*qdisc,
+clockid_t clockid)
 {
-   hrtimer_init(&wd->timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+   hrtimer_init(&wd->timer, clockid, HRTIMER_MODE_ABS_PINNED);
wd->timer.function = qdisc_watchdog;
wd->qdisc = qdisc;
 }
+EXPORT_SYMBOL(qdisc_watchdog_init_clockid);
+
+void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+{
+   qdisc_watchdog_init_clockid(wd, qdisc, CLOCK_MONOTONIC);
+}
 EXPORT_SYMBOL(qdisc_watchdog_init);
 
 void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires)
-- 
2.15.1



[RFC v2 net-next 02/10] net: ipv4: raw: Hook into time based transmission.

2018-01-17 Thread Jesus Sanchez-Palencia
From: Richard Cochran 

For raw packets, copy the desired future transmit time from the CMSG
cookie into the skb.

Signed-off-by: Richard Cochran 
Signed-off-by: Jesus Sanchez-Palencia 
---
 net/ipv4/raw.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 136544b36a46..e37ea8ab6a78 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -381,6 +381,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 
*fl4,
 
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = sockc->transmit_time;
skb_dst_set(skb, &rt->dst);
*rtp = NULL;
 
@@ -562,6 +563,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
}
 
ipc.sockc.tsflags = sk->sk_tsflags;
+   ipc.sockc.transmit_time = 0;
ipc.addr = inet->inet_saddr;
ipc.opt = NULL;
ipc.tx_flags = 0;
-- 
2.15.1



[PATCH net-next] net: core: Expose number of link up/down transitions

2018-01-17 Thread Florian Fainelli
From: David Decotigny 

Expose the number of times the link has been going UP or DOWN, and
update the "carrier_changes" counter to be the sum of these two events.
While at it, also update the sysfs-class-net documentation to cover:
carrier_changes (3.15), count_link_up (4.16) and count_link_down (4.16)

Signed-off-by: David Decotigny 
[Florian:
* rebase
* add documentation
* merge carrier_changes with up/down counters]
Signed-off-by: Florian Fainelli 
---
 Documentation/ABI/testing/sysfs-class-net | 24 
 include/linux/netdevice.h |  6 --
 include/uapi/linux/if_link.h  |  2 ++
 net/core/net-sysfs.c  | 23 ++-
 net/core/rtnetlink.c  | 13 +++--
 net/sched/sch_generic.c   |  4 ++--
 6 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-net 
b/Documentation/ABI/testing/sysfs-class-net
index 6856da99b6f7..e4b0d5157305 100644
--- a/Documentation/ABI/testing/sysfs-class-net
+++ b/Documentation/ABI/testing/sysfs-class-net
@@ -259,3 +259,27 @@ Contact:   netdev@vger.kernel.org
 Description:
Symbolic link to the PHY device this network device is attached
to.
+
+What:  /sys/class/net/carrier_changes));
+  atomic_read(&netdev->count_link_up) +
+  atomic_read(&netdev->count_link_down));
 }
 static DEVICE_ATTR_RO(carrier_changes);
 
+static ssize_t count_link_up_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   struct net_device *netdev = to_net_dev(dev);
+
+   return sprintf(buf, fmt_dec, atomic_read(&netdev->count_link_up));
+}
+static DEVICE_ATTR_RO(count_link_up);
+
+static ssize_t count_link_down_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct net_device *netdev = to_net_dev(dev);
+
+   return sprintf(buf, fmt_dec, atomic_read(&netdev->count_link_down));
+}
+static DEVICE_ATTR_RO(count_link_down);
+
 /* read-write attributes */
 
 static int change_mtu(struct net_device *dev, unsigned long new_mtu)
@@ -547,6 +566,8 @@ static struct attribute *net_class_attrs[] __ro_after_init 
= {
&dev_attr_phys_port_name.attr,
&dev_attr_phys_switch_id.attr,
&dev_attr_proto_down.attr,
+   &dev_attr_count_link_up.attr,
+   &dev_attr_count_link_down.attr,
NULL,
 };
 ATTRIBUTE_GROUPS(net_class);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 16d644a4f974..68f69a956713 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -990,6 +990,8 @@ static noinline size_t if_nlmsg_size(const struct 
net_device *dev,
   + nla_total_size(4)  /* IFLA_NEW_NETNSID */
   + nla_total_size(1)  /* IFLA_PROTO_DOWN */
   + nla_total_size(4)  /* IFLA_IF_NETNSID */
+  + nla_total_size(4)  /* IFLA_COUNT_LINK_UP */
+  + nla_total_size(4)  /* IFLA_COUNT_LINK_DOWN */
   + 0;
 }
 
@@ -1551,8 +1553,13 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 nla_put_string(skb, IFLA_QDISC, dev->qdisc->ops->id)) ||
nla_put_ifalias(skb, dev) ||
nla_put_u32(skb, IFLA_CARRIER_CHANGES,
-   atomic_read(&dev->carrier_changes)) ||
-   nla_put_u8(skb, IFLA_PROTO_DOWN, dev->proto_down))
+   atomic_read(&dev->count_link_up) +
+   atomic_read(&dev->count_link_down)) ||
+   nla_put_u8(skb, IFLA_PROTO_DOWN, dev->proto_down) ||
+   nla_put_u32(skb, IFLA_COUNT_LINK_UP,
+   atomic_read(&dev->count_link_up)) ||
+   nla_put_u32(skb, IFLA_COUNT_LINK_DOWN,
+   atomic_read(&dev->count_link_down)))
goto nla_put_failure;
 
if (event != IFLA_EVENT_NONE) {
@@ -1656,6 +1663,8 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
[IFLA_EVENT]= { .type = NLA_U32 },
[IFLA_GROUP]= { .type = NLA_U32 },
[IFLA_IF_NETNSID]   = { .type = NLA_S32 },
+   [IFLA_COUNT_LINK_UP]= { .type = NLA_U32 },
+   [IFLA_COUNT_LINK_DOWN]  = { .type = NLA_U32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ef8b4ecde2ac..28941636afa3 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -510,7 +510,7 @@ void netif_carrier_on(struct net_device *dev)
if (test_and_clear_bit(__LINK_STATE_NOCARRIER, &dev->state)) {
if (dev->reg_state == NETREG_UNINITIALIZED)
return;
-   atomic_inc(&dev->carrier_changes);
+   atomic_inc(&dev->count_link_up);
linkwatch_fire_event(dev);
if (netif_running(dev))
__n

[RFC v2 net-next 06/10] net/sched: Introduce the TBS Qdisc

2018-01-17 Thread Jesus Sanchez-Palencia
From: Vinicius Costa Gomes 

TBS (Time Based Scheduler) uses the information added earlier in this
series (the socket option SO_TXTIME and the new role of
sk_buff->tstamp) to schedule traffic transmission based on absolute
time.

For some workloads, just bandwidth enforcement is not enough, and
precise control of the transmission of packets is necessary.

Example:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs delta 6 clockid 11 offload 1

In this example, the Qdisc will try to enable offloading (offload 1)
the control of the transmission time to the network adapter, the
time stamp in socket are in reference to the clockid '11' (CLOCK_TAI)
and packets leave the Qdisc "delta" (6) nanoseconds before its
transmission time.

When offloading is disabled, the network adapter will ignore the
sk_buff time stamp, and so, the transmission time will be only "best
effort" from the Qdisc.

Signed-off-by: Vinicius Costa Gomes 
Signed-off-by: Jesus Sanchez-Palencia 
---
 include/linux/netdevice.h  |   1 +
 include/net/pkt_sched.h|   5 +
 include/uapi/linux/pkt_sched.h |  17 ++
 net/sched/Kconfig  |  11 ++
 net/sched/Makefile |   1 +
 net/sched/sch_tbs.c| 392 +
 6 files changed, 427 insertions(+)
 create mode 100644 net/sched/sch_tbs.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ed0799a12bf2..e87031bd108e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -781,6 +781,7 @@ enum tc_setup_type {
TC_SETUP_QDISC_CBS,
TC_SETUP_QDISC_RED,
TC_SETUP_QDISC_PRIO,
+   TC_SETUP_QDISC_TBS,
 };
 
 /* These structures hold the attributes of bpf state that are being passed
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 2466ea143d01..d042ffda7f21 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -155,4 +155,9 @@ struct tc_cbs_qopt_offload {
s32 sendslope;
 };
 
+struct tc_tbs_qopt_offload {
+   u8 enable;
+   s32 queue;
+};
+
 #endif
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096ae97b..6bb39944ba32 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -934,4 +934,21 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+
+/* TBS */
+struct tc_tbs_qopt {
+   __u8 offload;
+   __u8 _pad[3];
+   __s32 delta;
+   __s32 clockid;
+};
+
+enum {
+   TCA_TBS_UNSPEC,
+   TCA_TBS_PARMS,
+   __TCA_TBS_MAX,
+};
+
+#define TCA_TBS_MAX (__TCA_TBS_MAX - 1)
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index c03d86a7775e..7d54045995a3 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -183,6 +183,17 @@ config NET_SCH_CBS
  To compile this code as a module, choose M here: the
  module will be called sch_cbs.
 
+config NET_SCH_TBS
+   tristate "Time Based Scheduler (TBS)"
+   ---help---
+ Say Y here if you want to use the Time Based Scheduler (TBS) packet
+ scheduling algorithm.
+
+ See the top of  for more details.
+
+ To compile this code as a module, choose M here: the
+ module will be called sch_tbs.
+
 config NET_SCH_GRED
tristate "Generic Random Early Detection (GRED)"
---help---
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 5b635447e3f8..0f7f29505c89 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -54,6 +54,7 @@ obj-$(CONFIG_NET_SCH_FQ)  += sch_fq.o
 obj-$(CONFIG_NET_SCH_HHF)  += sch_hhf.o
 obj-$(CONFIG_NET_SCH_PIE)  += sch_pie.o
 obj-$(CONFIG_NET_SCH_CBS)  += sch_cbs.o
+obj-$(CONFIG_NET_SCH_TBS)  += sch_tbs.o
 
 obj-$(CONFIG_NET_CLS_U32)  += cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)   += cls_route.o
diff --git a/net/sched/sch_tbs.c b/net/sched/sch_tbs.c
new file mode 100644
index ..300456063ac9
--- /dev/null
+++ b/net/sched/sch_tbs.c
@@ -0,0 +1,392 @@
+/*
+ * net/sched/sch_tbs.c Time Based Shaper
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Vinicius Costa Gomes 
+ * Jesus Sanchez-Palencia 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct tbs_sched_data {
+   bool offload;
+   int clockid;
+   int queue;
+   s32 delta; /* in ns */
+   ktime_t last; /* The txtime of the last skb sent to the netdevice. */
+   struct rb_root head;
+   struct qdisc_watchdog watchdog;
+   struct Qdisc *qdisc;
+};
+
+static 

[RFC v2 net-next 04/10] net: packet: Hook into time based transmission.

2018-01-17 Thread Jesus Sanchez-Palencia
From: Richard Cochran 

For raw layer-2 packets, copy the desired future transmit time from
the CMSG cookie into the skb.

Signed-off-by: Richard Cochran 
Signed-off-by: Jesus Sanchez-Palencia 
---
 net/packet/af_packet.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 05d31864a34e..6af129a65e85 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1976,6 +1976,7 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
goto out_unlock;
}
 
+   sockc.transmit_time = 0;
sockc.tsflags = sk->sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(sk, msg, &sockc);
@@ -1987,6 +1988,7 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
skb->dev = dev;
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = sockc.transmit_time;
 
sock_tx_timestamp(sk, sockc.tsflags, &skb_shinfo(skb)->tx_flags);
 
@@ -2484,6 +2486,7 @@ static int tpacket_fill_skb(struct packet_sock *po, 
struct sk_buff *skb,
skb->dev = dev;
skb->priority = po->sk.sk_priority;
skb->mark = po->sk.sk_mark;
+   skb->tstamp = sockc->transmit_time;
sock_tx_timestamp(&po->sk, sockc->tsflags, &skb_shinfo(skb)->tx_flags);
skb_shinfo(skb)->destructor_arg = ph.raw;
 
@@ -2660,6 +2663,7 @@ static int tpacket_snd(struct packet_sock *po, struct 
msghdr *msg)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_put;
 
+   sockc.transmit_time = 0;
sockc.tsflags = po->sk.sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(&po->sk, msg, &sockc);
@@ -2856,6 +2860,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_unlock;
 
+   sockc.transmit_time = 0;
sockc.tsflags = sk->sk_tsflags;
sockc.mark = sk->sk_mark;
if (msg->msg_controllen) {
@@ -2928,6 +2933,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
skb->dev = dev;
skb->priority = sk->sk_priority;
skb->mark = sockc.mark;
+   skb->tstamp = sockc.transmit_time;
 
if (has_vnet_hdr) {
err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
-- 
2.15.1



  1   2   3   4   >