Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets

2016-12-06 Thread Paolo Abeni
On Tue, 2016-12-06 at 19:32 -0800, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> Paolo noticed a cache line miss in UDP recvmsg() to access
> sk_rxhash, sharing a cache line with sk_drops.
> 
> sk_drops might be heavily incremented by cpus handling a flood targeting
> this socket.
> 
> We might place sk_drops on a separate cache line, but lets try
> to avoid wasting 64 bytes per socket just for this, since we have
> other bottlenecks to take care of.
> 
> sock_rps_record_flow() should only access sk_rxhash for connected
> flows.
> 
> Testing sk_state for TCP_ESTABLISHED covers most of the cases for
> connected sockets, for a zero cost, since system calls using
> sock_rps_record_flow() also access sk->sk_prot which is on the
> same cache line.
> 
> A follow up patch will provide a static_key (Jump Label) since most
> hosts do not even use RFS.
> 
> Signed-off-by: Eric Dumazet 
> Reported-by: Paolo Abeni 
> ---
>  include/net/sock.h |   12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 
> 6dfe3aa22b970eecfab4d4a0753804b1cc82a200..a7ddab993b496f1f4060f0b41831a161c284df9e
>  100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -913,7 +913,17 @@ static inline void sock_rps_record_flow_hash(__u32 hash)
>  static inline void sock_rps_record_flow(const struct sock *sk)
>  {
>  #ifdef CONFIG_RPS
> - sock_rps_record_flow_hash(sk->sk_rxhash);
> + /* Reading sk->sk_rxhash might incur an expensive cache line miss.
> +  *
> +  * TCP_ESTABLISHED does cover almost all states where RFS
> +  * might be useful, and is cheaper [1] than testing :
> +  *  IPv4: inet_sk(sk)->inet_daddr
> +  *  IPv6: ipv6_addr_any(>sk_v6_daddr)
> +  * OR   an additional socket flag
> +  * [1] : sk_state and sk_prot are in the same cache line.
> +  */
> + if (sk->sk_state == TCP_ESTABLISHED)
> + sock_rps_record_flow_hash(sk->sk_rxhash);
>  #endif
>  }

Thank you for the very prompt patch!

You made me curious about your other idea on this topic, this what you
initially talked about, right ?

LGTM.

Acked-by: Paolo Abeni 



Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets

2016-12-06 Thread Paolo Abeni
On Tue, 2016-12-06 at 22:47 -0800, Eric Dumazet wrote:
> On Tue, 2016-12-06 at 19:32 -0800, Eric Dumazet wrote:
> > A follow up patch will provide a static_key (Jump Label) since most
> > hosts do not even use RFS.
> 
> Speaking of static_key, it appears we now have GRO on UDP, and this
> consumes a considerable amount of cpu cycles.
> 
> Turning off GRO allows me to get +20 % more packets on my single UDP
> socket. (1.2 Mpps instead of 1.0 Mpps)

I see also an improvement for single flow tests disabling GRO, but on a
smaller scale (~5% if I recall correctly).

> Surely udp_gro_receive() should be bypassed if no UDP socket has
> registered a udp_sk(sk)->gro_receive handler 
> 
> And/or delay the inet_add_offload({4|6}_offload, IPPROTO_UDP); to
> the first UDP sockets setting udp_sk(sk)->gro_receive handler,
> ie udp_encap_enable() and udpv6_encap_enable()

I had some patches adding explicit static keys for udp_gro_receive, but
they were ugly and I did not get that much gain (I measured ~1-2%
skipping udp_gro_receive only). I can try to refresh them anyway.

We have some experimental patches to implement GRO for plain UDP
connected sockets, using frag_list to preserve the individual skb len,
and deliver the packet to user space individually. With that I got
~3mpps with a single queue/user space sink - before the recent udp
improvements. I would like to present these patches on netdev soon (no
sooner than next week, anyway).

Cheers,

Paolo



[PATCH 1/1] ixgbe: fcoe: return value of skb_linearize should be handled

2016-12-06 Thread Zhouyi Zhou
Signed-off-by: Zhouyi Zhou 
Reviewed-by: Cong Wang 
Reviewed-by: Yuval Shaia  
Reviewed-by: Eric Dumazet 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c | 6 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 +--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
index 2a653ec..7b6bdb7 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
@@ -490,7 +490,11 @@ int ixgbe_fcoe_ddp(struct ixgbe_adapter *adapter,
 */
if ((fh->fh_r_ctl == FC_RCTL_DD_SOL_DATA) &&
(fctl & FC_FC_END_SEQ)) {
-   skb_linearize(skb);
+   int err;
+
+   err = skb_linearize(skb);
+   if (err)
+   return err;
crc = (struct fcoe_crc_eof *)skb_put(skb, sizeof(*crc));
crc->fcoe_eof = FC_EOF_T;
}
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index fee1f29..4926d48 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2173,8 +2173,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
total_rx_bytes += ddp_bytes;
total_rx_packets += DIV_ROUND_UP(ddp_bytes,
 mss);
-   }
-   if (!ddp_bytes) {
+   } else {
dev_kfree_skb_any(skb);
continue;
}
-- 
1.9.1



[net-next 00/20][pull request] 40GbE Intel Wired LAN Driver Updates 2016-12-06

2016-12-06 Thread Jeff Kirsher
This series contains updates to i40e and i40evf only.

Filip modifies the i40e to log link speed change and when the link is
brought up and down.

Mitch replaces i40e_txd_use_count() with a new function which is slightly
faster and better documented so the dim witted can better follow the
code.  Fixes the locking of the service task so that it is actually
done in the service task and not in the scheduling function which calls
the service task.

Jacob, being the busy little beaver he is, provides most of the changes
starting restores a workaround that is still needed in some configurations,
specifically the Ethernet Controller XL710 for 40GbE QSFP+.  Removes
duplicate code and simplifies the i40e_vsi_add_vlan() and
i40e_vsi_kill_vlan() functions.  Removes detection of PTP frames over L4
(UDP) on the XL710 MAC, since there was a product decision to defeature
it.  Fixed a previous refactor of active filters which caused issues in
the accounting of active_filters.  Remaining work was done in the VLAN
filters to improve readability and simplify code as much as possible
to reduce inconsistencies.

Alex fixes foul budget accounting in core code by returning actual
work done, capped to budget-1.

Henry fixes the "ethtool -p" function for 1G BaseT PHYs.

Carolyn adds support for 25G devices for i40e and i40evf.

Michal adds functions to apply the correct access method for external PHYs
which could use Clause22 or Clause45 depending on the PHY.

The following are changes since commit d4aea20d889e05575bb331a3dadf176176f7d631:
  tun: Use netif_receive_skb instead of netif_rx
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE

Alexander Duyck (1):
  i40e/i40evf: napi_poll must return the work done

Bimmy Pujari (1):
  Changed version from 1.6.21 to 1.6.25

Carolyn Wyborny (2):
  i40e: Add support for 25G devices
  i40e: Add FEC for 25g

Filip Sadowski (1):
  i40e: Driver prints log message on link speed change

Henry Tieman (1):
  i40e: Blink LED on 1G BaseT boards

Jacob Keller (11):
  i40e: restore workaround for removing default MAC filter
  i40e: remove code to handle dev_addr specially
  i40e: use unsigned printf format specifier for active_filters count
  i40e: defeature support for PTP L4 frame detection on XL710
  i40e: recalculate vsi->active_filters from hash contents
  i40e: refactor i40e_update_filter_state to avoid passing aq_err
  i40e: delete filter after adding its replacement when converting
  i40e: factor out addition/deletion of VLAN per each MAC address
  i40e: use (add|rm)_vlan_all_mac helper functions when changing PVID
  i40e: move all updates for VLAN mode into i40e_sync_vsi_filters
  i40e: don't allow i40e_vsi_(add|kill)_vlan to operate when VID<1

Michal Kosiarz (1):
  i40e: Add functions which apply correct PHY access method for read and
write operation

Mitch Williams (2):
  i40e: simplify txd use count calculation
  i40e: lock service task correctly

 drivers/net/ethernet/intel/i40e/i40e.h |  10 +-
 drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h  |  51 ++-
 drivers/net/ethernet/intel/i40e/i40e_common.c  |  85 +++-
 drivers/net/ethernet/intel/i40e/i40e_debugfs.c |   2 +-
 drivers/net/ethernet/intel/i40e/i40e_devids.h  |   2 +
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |  51 ++-
 drivers/net/ethernet/intel/i40e/i40e_main.c| 491 -
 drivers/net/ethernet/intel/i40e/i40e_prototype.h   |   4 +
 drivers/net/ethernet/intel/i40e/i40e_ptp.c |  21 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|   2 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h|  45 +-
 drivers/net/ethernet/intel/i40e/i40e_type.h|  82 ++--
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |  46 +-
 .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h|  51 ++-
 drivers/net/ethernet/intel/i40evf/i40e_common.c|   2 +
 drivers/net/ethernet/intel/i40evf/i40e_devids.h|   2 +
 drivers/net/ethernet/intel/i40evf/i40e_prototype.h |   4 +
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c  |   2 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h  |  45 +-
 drivers/net/ethernet/intel/i40evf/i40e_type.h  |  82 ++--
 drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c |   8 +
 drivers/net/ethernet/intel/i40evf/i40evf_main.c|   2 +-
 .../net/ethernet/intel/i40evf/i40evf_virtchnl.c|   3 +
 23 files changed, 731 insertions(+), 362 deletions(-)

-- 
2.9.3



[net-next 17/20] i40e: factor out addition/deletion of VLAN per each MAC address

2016-12-06 Thread Jeff Kirsher
From: Jacob Keller 

A future refactor of how the PF assigns a PVID to a VF will want to be
able to add and remove a block of filters by VLAN without worrying about
accidentally triggering the accounting for I40E_VLAN_ANY. Additionally
the PVID assignment would like to be able to batch several changes under
one use of the mac_filter_hash_lock.

Factor out the addition and deletion of a VLAN on all MACs into their
own function which i40e_vsi_(add|kill)_vlan can use. These new functions
expect the caller to take the hash lock, as well as perform any
necessary accounting for updating I40E_VLAN_ANY filters if we are now
operating under VLAN mode.

Change-ID: If79e5b60b770433275350a74b3f1880333a185d5
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 68 +++--
 1 file changed, 55 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index f9e9c90..8aedfb7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -2493,19 +2493,24 @@ static void i40e_vlan_rx_register(struct net_device 
*netdev, u32 features)
 }
 
 /**
- * i40e_vsi_add_vlan - Add vsi membership for given vlan
+ * i40e_add_vlan_all_mac - Add a MAC/VLAN filter for each existing MAC address
  * @vsi: the vsi being configured
  * @vid: vlan id to be added (0 = untagged only , -1 = any)
+ *
+ * This is a helper function for adding a new MAC/VLAN filter with the
+ * specified VLAN for each existing MAC address already in the hash table.
+ * This function does *not* perform any accounting to update filters based on
+ * VLAN mode.
+ *
+ * NOTE: this function expects to be called while under the
+ * mac_filter_hash_lock
  **/
-int i40e_vsi_add_vlan(struct i40e_vsi *vsi, s16 vid)
+static int i40e_add_vlan_all_mac(struct i40e_vsi *vsi, s16 vid)
 {
-   struct i40e_mac_filter *f, *add_f, *del_f;
+   struct i40e_mac_filter *f, *add_f;
struct hlist_node *h;
int bkt;
 
-   /* Locked once because all functions invoked below iterates list*/
-   spin_lock_bh(>mac_filter_hash_lock);
-
hash_for_each_safe(vsi->mac_filter_hash, bkt, h, f, hlist) {
if (f->state == I40E_FILTER_REMOVE)
continue;
@@ -2514,11 +2519,33 @@ int i40e_vsi_add_vlan(struct i40e_vsi *vsi, s16 vid)
dev_info(>back->pdev->dev,
 "Could not add vlan filter %d for %pM\n",
 vid, f->macaddr);
-   spin_unlock_bh(>mac_filter_hash_lock);
return -ENOMEM;
}
}
 
+   return 0;
+}
+
+/**
+ * i40e_vsi_add_vlan - Add VSI membership for given VLAN
+ * @vsi: the VSI being configured
+ * @vid: VLAN id to be added (0 = untagged only , -1 = any)
+ **/
+int i40e_vsi_add_vlan(struct i40e_vsi *vsi, s16 vid)
+{
+   struct i40e_mac_filter *f, *add_f, *del_f;
+   struct hlist_node *h;
+   int bkt, err;
+
+   /* Locked once because all functions invoked below iterates list*/
+   spin_lock_bh(>mac_filter_hash_lock);
+
+   err = i40e_add_vlan_all_mac(vsi, vid);
+   if (err) {
+   spin_unlock_bh(>mac_filter_hash_lock);
+   return err;
+   }
+
/* When we add a new VLAN filter, we need to make sure that all existing
 * filters which are marked as vid=-1 (I40E_VLAN_ANY) are converted to
 * vid=0. The simplest way is just search for all filters marked as
@@ -2557,24 +2584,39 @@ int i40e_vsi_add_vlan(struct i40e_vsi *vsi, s16 vid)
 }
 
 /**
- * i40e_vsi_kill_vlan - Remove vsi membership for given vlan
+ * i40e_rm_vlan_all_mac - Remove MAC/VLAN pair for all MAC with the given VLAN
  * @vsi: the vsi being configured
  * @vid: vlan id to be removed (0 = untagged only , -1 = any)
- **/
-void i40e_vsi_kill_vlan(struct i40e_vsi *vsi, s16 vid)
+ *
+ * This function should be used to remove all VLAN filters which match the
+ * given VID. It does not schedule the service event and does not take the
+ * mac_filter_hash_lock so it may be combined with other operations under
+ * a single invocation of the mac_filter_hash_lock.
+ *
+ * NOTE: this function expects to be called while under the
+ * mac_filter_hash_lock
+ */
+static void i40e_rm_vlan_all_mac(struct i40e_vsi *vsi, s16 vid)
 {
struct i40e_mac_filter *f;
struct hlist_node *h;
int bkt;
 
-   /* Locked once because all functions invoked below iterates list */
-   spin_lock_bh(>mac_filter_hash_lock);
-
hash_for_each_safe(vsi->mac_filter_hash, bkt, h, f, hlist) {
if (f->vlan == vid)
__i40e_del_filter(vsi, f);
}
+}
 
+/**
+ * i40e_vsi_kill_vlan - 

[net-next 08/20] i40e: use unsigned printf format specifier for active_filters count

2016-12-06 Thread Jeff Kirsher
From: Jacob Keller 

Replace the %d specifier used for printing vsi->active_filters and
vsi->promisc_threshold with an unsigned %u format specifier. While it is
unlikely in practice that these values will ever reach such a large
number they are unsigned values and thus should not be interpreted as
negative numbers.

Change-ID: Iff050fad5a1c8537c4c57fcd527441cd95cfc0d4
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_debugfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c 
b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
index b8a03a0..f1f41f1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
@@ -172,7 +172,7 @@ static void i40e_dbg_dump_vsi_seid(struct i40e_pf *pf, int 
seid)
 f->macaddr, f->vlan,
 i40e_filter_state_string[f->state]);
}
-   dev_info(>pdev->dev, "active_filters %d, promisc_threshold %d, 
overflow promisc %s\n",
+   dev_info(>pdev->dev, "active_filters %u, promisc_threshold %u, 
overflow promisc %s\n",
 vsi->active_filters, vsi->promisc_threshold,
 (test_bit(__I40E_FILTER_OVERFLOW_PROMISC, >state) ?
  "ON" : "OFF"));
-- 
2.9.3



[net-next 07/20] Changed version from 1.6.21 to 1.6.25

2016-12-06 Thread Jeff Kirsher
From: Bimmy Pujari 

Signed-off-by: Bimmy Pujari 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index dbb854b..aecf63b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -41,7 +41,7 @@ static const char i40e_driver_string[] =
 
 #define DRV_VERSION_MAJOR 1
 #define DRV_VERSION_MINOR 6
-#define DRV_VERSION_BUILD 21
+#define DRV_VERSION_BUILD 25
 #define DRV_VERSION __stringify(DRV_VERSION_MAJOR) "." \
 __stringify(DRV_VERSION_MINOR) "." \
 __stringify(DRV_VERSION_BUILD)DRV_KERN
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index ca85021..c0fc533 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -38,7 +38,7 @@ static const char i40evf_driver_string[] =
 
 #define DRV_VERSION_MAJOR 1
 #define DRV_VERSION_MINOR 6
-#define DRV_VERSION_BUILD 21
+#define DRV_VERSION_BUILD 25
 #define DRV_VERSION __stringify(DRV_VERSION_MAJOR) "." \
 __stringify(DRV_VERSION_MINOR) "." \
 __stringify(DRV_VERSION_BUILD) \
-- 
2.9.3



[net-next 03/20] i40e: restore workaround for removing default MAC filter

2016-12-06 Thread Jeff Kirsher
From: Jacob Keller 

A previous commit 53cb6e9e8949 ("i40e: Removal of workaround for simple
MAC address filter deletion") removed a workaround for some
firmware versions which was reported to not be necessary in production
NICs. Unfortunately this workaround is necessary in some configurations,
specifically the Ethernet Controller XL710 for 40GbE QSFP+ (8086:1583).

Without this patch, the mentioned NICs with current firmware exhibit
issues when adding VLANs, as outlined by the following reproduction:

  $modprobe i40e
  $ip link set  up
  $ip link add link  vlan100 type vlan id 100
  $dmesg | tail
  
  kernel: i40e :82:00.0: Error I40E_AQ_RC_EINVAL adding RX
filters on PF, promiscuous mode forced on

This results in filters being marked as FAILED and setting the device in
promiscuous mode.

The root cause of receiving the -EINVAL error response appears to be due
to a conflict with the default MAC filter which still exists on the
default firmware for this device. Attempting to add a new VLAN filter on
the default MAC address conflicts with the IGNORE_VLAN setting on the
default rule.

Change-ID: I4d8f6d48ac5f60cfe981b3baad30eb4d7c170d61
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 4534d41..c467cc4 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -1226,6 +1226,39 @@ bool i40e_is_vsi_in_vlan(struct i40e_vsi *vsi)
 }
 
 /**
+ * i40e_rm_default_mac_filter - Remove the default MAC filter set by NVM
+ * @vsi: the PF Main VSI - inappropriate for any other VSI
+ * @macaddr: the MAC address
+ *
+ * Remove whatever filter the firmware set up so the driver can manage
+ * its own filtering intelligently.
+ **/
+static void i40e_rm_default_mac_filter(struct i40e_vsi *vsi, u8 *macaddr)
+{
+   struct i40e_aqc_remove_macvlan_element_data element;
+   struct i40e_pf *pf = vsi->back;
+
+   /* Only appropriate for the PF main VSI */
+   if (vsi->type != I40E_VSI_MAIN)
+   return;
+
+   memset(, 0, sizeof(element));
+   ether_addr_copy(element.mac_addr, macaddr);
+   element.vlan_tag = 0;
+   /* Ignore error returns, some firmware does it this way... */
+   element.flags = I40E_AQC_MACVLAN_DEL_PERFECT_MATCH;
+   i40e_aq_remove_macvlan(>hw, vsi->seid, , 1, NULL);
+
+   memset(, 0, sizeof(element));
+   ether_addr_copy(element.mac_addr, macaddr);
+   element.vlan_tag = 0;
+   /* ...and some firmware does it this way. */
+   element.flags = I40E_AQC_MACVLAN_DEL_PERFECT_MATCH |
+   I40E_AQC_MACVLAN_DEL_IGNORE_VLAN;
+   i40e_aq_remove_macvlan(>hw, vsi->seid, , 1, NULL);
+}
+
+/**
  * i40e_add_filter - Add a mac/vlan filter to the VSI
  * @vsi: the VSI to be searched
  * @macaddr: the MAC address
@@ -9295,6 +9328,12 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
if (vsi->type == I40E_VSI_MAIN) {
SET_NETDEV_DEV(netdev, >pdev->dev);
ether_addr_copy(mac_addr, hw->mac.perm_addr);
+   /* The following steps are necessary to prevent reception
+* of tagged packets - some older NVM configurations load a
+* default a MAC-VLAN filter that accepts any tagged packet
+* which must be replaced by a normal filter.
+*/
+   i40e_rm_default_mac_filter(vsi, mac_addr);
spin_lock_bh(>mac_filter_hash_lock);
i40e_add_filter(vsi, mac_addr, I40E_VLAN_ANY);
spin_unlock_bh(>mac_filter_hash_lock);
@@ -9828,6 +9867,8 @@ static struct i40e_vsi *i40e_vsi_reinit_setup(struct 
i40e_vsi *vsi)
pf->vsi[pf->lan_vsi]->tc_config.enabled_tc = 0;
pf->vsi[pf->lan_vsi]->seid = pf->main_vsi_seid;
i40e_vsi_config_tc(pf->vsi[pf->lan_vsi], enabled_tc);
+   if (vsi->type == I40E_VSI_MAIN)
+   i40e_rm_default_mac_filter(vsi, pf->hw.mac.perm_addr);
 
/* assign it some queues */
ret = i40e_alloc_rings(vsi);
-- 
2.9.3



[net-next 11/20] i40e: Add functions which apply correct PHY access method for read and write operation

2016-12-06 Thread Jeff Kirsher
From: Michal Kosiarz 

Depending on external PHY type, register access method should be
different. Clause22 or Clause45 can be chosen for different PHYs.
Implemented functions apply correct access method for used device.

Change-ID: If39d5f0da9c0b905a8cbdc1ab89885535e7d0426
Signed-off-by: Michal Kosiarz 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_common.c  | 72 ++
 drivers/net/ethernet/intel/i40e/i40e_prototype.h   |  4 ++
 drivers/net/ethernet/intel/i40evf/i40e_prototype.h |  4 ++
 3 files changed, 80 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c 
b/drivers/net/ethernet/intel/i40e/i40e_common.c
index f8c4c14..1287359 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -4676,6 +4676,78 @@ i40e_status i40e_write_phy_register_clause45(struct 
i40e_hw *hw,
 }
 
 /**
+ * i40e_write_phy_register
+ * @hw: pointer to the HW structure
+ * @page: registers page number
+ * @reg: register address in the page
+ * @phy_adr: PHY address on MDIO interface
+ * @value: PHY register value
+ *
+ * Writes value to specified PHY register
+ **/
+i40e_status i40e_write_phy_register(struct i40e_hw *hw,
+   u8 page, u16 reg, u8 phy_addr, u16 value)
+{
+   i40e_status status;
+
+   switch (hw->device_id) {
+   case I40E_DEV_ID_1G_BASE_T_X722:
+   status = i40e_write_phy_register_clause22(hw, reg, phy_addr,
+ value);
+   break;
+   case I40E_DEV_ID_10G_BASE_T:
+   case I40E_DEV_ID_10G_BASE_T4:
+   case I40E_DEV_ID_10G_BASE_T_X722:
+   case I40E_DEV_ID_25G_B:
+   case I40E_DEV_ID_25G_SFP28:
+   status = i40e_write_phy_register_clause45(hw, page, reg,
+ phy_addr, value);
+   break;
+   default:
+   status = I40E_ERR_UNKNOWN_PHY;
+   break;
+   }
+
+   return status;
+}
+
+/**
+ * i40e_read_phy_register
+ * @hw: pointer to the HW structure
+ * @page: registers page number
+ * @reg: register address in the page
+ * @phy_adr: PHY address on MDIO interface
+ * @value: PHY register value
+ *
+ * Reads specified PHY register value
+ **/
+i40e_status i40e_read_phy_register(struct i40e_hw *hw,
+  u8 page, u16 reg, u8 phy_addr, u16 *value)
+{
+   i40e_status status;
+
+   switch (hw->device_id) {
+   case I40E_DEV_ID_1G_BASE_T_X722:
+   status = i40e_read_phy_register_clause22(hw, reg, phy_addr,
+value);
+   break;
+   case I40E_DEV_ID_10G_BASE_T:
+   case I40E_DEV_ID_10G_BASE_T4:
+   case I40E_DEV_ID_10G_BASE_T_X722:
+   case I40E_DEV_ID_25G_B:
+   case I40E_DEV_ID_25G_SFP28:
+   status = i40e_read_phy_register_clause45(hw, page, reg,
+phy_addr, value);
+   break;
+   default:
+   status = I40E_ERR_UNKNOWN_PHY;
+   break;
+   }
+
+   return status;
+}
+
+/**
  * i40e_get_phy_address
  * @hw: pointer to the HW structure
  * @dev_num: PHY port num that address we want
diff --git a/drivers/net/ethernet/intel/i40e/i40e_prototype.h 
b/drivers/net/ethernet/intel/i40e/i40e_prototype.h
index 37d67e7..2551fc8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_prototype.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_prototype.h
@@ -373,6 +373,10 @@ i40e_status i40e_read_phy_register_clause45(struct i40e_hw 
*hw,
u8 page, u16 reg, u8 phy_addr, u16 *value);
 i40e_status i40e_write_phy_register_clause45(struct i40e_hw *hw,
u8 page, u16 reg, u8 phy_addr, u16 value);
+i40e_status i40e_read_phy_register(struct i40e_hw *hw, u8 page, u16 reg,
+  u8 phy_addr, u16 *value);
+i40e_status i40e_write_phy_register(struct i40e_hw *hw, u8 page, u16 reg,
+   u8 phy_addr, u16 value);
 u8 i40e_get_phy_address(struct i40e_hw *hw, u8 dev_num);
 i40e_status i40e_blink_phy_link_led(struct i40e_hw *hw,
u32 time, u32 interval);
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_prototype.h 
b/drivers/net/ethernet/intel/i40evf/i40e_prototype.h
index d89d521..ba6c6bd 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_prototype.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_prototype.h
@@ -115,6 +115,10 @@ i40e_status i40e_read_phy_register(struct i40e_hw *hw, u8 
page,
   u16 reg, u8 phy_addr, u16 *value);
 i40e_status i40e_write_phy_register(struct i40e_hw *hw, u8 page,

[net-next 09/20] i40e: Add support for 25G devices

2016-12-06 Thread Jeff Kirsher
From: Carolyn Wyborny 

Add support for 25G devices - defines and data structures.

One tricky part here is that the firmware support for these
Devices introduces a mismatch between the PHY type enum and
the bitfields for the phy types.

This change creates a macro and uses it to increment the 25G
PHY values when creating 25G bitfields.

Change-ID: I69b24d837d44cf9220bf5cb8dd46c5be89ce490b
Signed-off-by: Carolyn Wyborny 
Signed-off-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h  | 30 +++-
 drivers/net/ethernet/intel/i40e/i40e_common.c  | 11 ++-
 drivers/net/ethernet/intel/i40e/i40e_devids.h  |  2 +
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 26 ++-
 drivers/net/ethernet/intel/i40e/i40e_main.c|  6 +-
 drivers/net/ethernet/intel/i40e/i40e_type.h| 82 +-
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |  3 +
 .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h| 30 +++-
 drivers/net/ethernet/intel/i40evf/i40e_common.c|  2 +
 drivers/net/ethernet/intel/i40evf/i40e_devids.h|  2 +
 drivers/net/ethernet/intel/i40evf/i40e_type.h  | 82 +-
 drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c |  8 +++
 .../net/ethernet/intel/i40evf/i40evf_virtchnl.c|  3 +
 13 files changed, 208 insertions(+), 79 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h 
b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
index 67e396b..c9d1f91 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
@@ -1642,6 +1642,10 @@ enum i40e_aq_phy_type {
I40E_PHY_TYPE_1000BASE_LX   = 0x1C,
I40E_PHY_TYPE_1000BASE_T_OPTICAL= 0x1D,
I40E_PHY_TYPE_20GBASE_KR2   = 0x1E,
+   I40E_PHY_TYPE_25GBASE_KR= 0x1F,
+   I40E_PHY_TYPE_25GBASE_CR= 0x20,
+   I40E_PHY_TYPE_25GBASE_SR= 0x21,
+   I40E_PHY_TYPE_25GBASE_LR= 0x22,
I40E_PHY_TYPE_MAX
 };
 
@@ -1650,6 +1654,7 @@ enum i40e_aq_phy_type {
 #define I40E_LINK_SPEED_10GB_SHIFT 0x3
 #define I40E_LINK_SPEED_40GB_SHIFT 0x4
 #define I40E_LINK_SPEED_20GB_SHIFT 0x5
+#define I40E_LINK_SPEED_25GB_SHIFT 0x6
 
 enum i40e_aq_link_speed {
I40E_LINK_SPEED_UNKNOWN = 0,
@@ -1657,7 +1662,8 @@ enum i40e_aq_link_speed {
I40E_LINK_SPEED_1GB = BIT(I40E_LINK_SPEED_1000MB_SHIFT),
I40E_LINK_SPEED_10GB= BIT(I40E_LINK_SPEED_10GB_SHIFT),
I40E_LINK_SPEED_40GB= BIT(I40E_LINK_SPEED_40GB_SHIFT),
-   I40E_LINK_SPEED_20GB= BIT(I40E_LINK_SPEED_20GB_SHIFT)
+   I40E_LINK_SPEED_20GB= BIT(I40E_LINK_SPEED_20GB_SHIFT),
+   I40E_LINK_SPEED_25GB= BIT(I40E_LINK_SPEED_25GB_SHIFT),
 };
 
 struct i40e_aqc_module_desc {
@@ -1690,7 +1696,13 @@ struct i40e_aq_get_phy_abilities_resp {
__le32  eeer_val;
u8  d3_lpan;
 #define I40E_AQ_SET_PHY_D3_LPAN_ENA0x01
-   u8  reserved[3];
+   u8  phy_type_ext;
+#define I40E_AQ_PHY_TYPE_EXT_25G_KR0X01
+#define I40E_AQ_PHY_TYPE_EXT_25G_CR0X02
+#define I40E_AQ_PHY_TYPE_EXT_25G_SR0x04
+#define I40E_AQ_PHY_TYPE_EXT_25G_LR0x08
+   u8  mod_type_ext;
+   u8  ext_comp_code;
u8  phy_id[4];
u8  module_type[3];
u8  qualified_module_count;
@@ -1712,7 +1724,12 @@ struct i40e_aq_set_phy_config { /* same bits as above in 
all */
__le16  eee_capability;
__le32  eeer;
u8  low_power_ctrl;
-   u8  reserved[3];
+   u8  phy_type_ext;
+#define I40E_AQ_PHY_TYPE_EXT_25G_KR0X01
+#define I40E_AQ_PHY_TYPE_EXT_25G_CR0X02
+#define I40E_AQ_PHY_TYPE_EXT_25G_SR0x04
+#define I40E_AQ_PHY_TYPE_EXT_25G_LR0x08
+   u8  reserved[2];
 };
 
 I40E_CHECK_CMD_LENGTH(i40e_aq_set_phy_config);
@@ -1792,6 +1809,13 @@ struct i40e_aqc_get_link_status {
 #define I40E_AQ_LINK_TX_DRAINED0x01
 #define I40E_AQ_LINK_TX_FLUSHED0x03
 #define I40E_AQ_LINK_FORCED_40G0x10
+/* 25G Error Codes */
+#define I40E_AQ_25G_NO_ERR 0X00
+#define I40E_AQ_25G_NOT_PRESENT0X01
+#define I40E_AQ_25G_NVM_CRC_ERR0X02
+#define I40E_AQ_25G_SBUS_UCODE_ERR 0X03
+#define I40E_AQ_25G_SERDES_UCODE_ERR   0X04
+#define I40E_AQ_25G_NIMB_UCODE_ERR 0X05
u8  loopback; /* use defines from i40e_aqc_set_lb_mode */
__le16  max_frame_size;
u8  config;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c 
b/drivers/net/ethernet/intel/i40e/i40e_common.c
index eb392d6..1318c7d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ 

[net-next 04/20] i40e/i40evf: napi_poll must return the work done

2016-12-06 Thread Jeff Kirsher
From: Alexander Duyck 

Currently the function i40e_napi-poll() returns 0 when it clean completely
the Rx rings, but this foul budget accounting in core code.

Fix this by returning the actual work done, capped to budget - 1, since
the core doesn't allow to return the full budget when the driver modifies
the NAPI status

This is based on a similar change that was made for the ixgbe driver by
Paolo Abeni.

Change-ID: Ic3d93ad2fa2fc8ce3164bc461e69367da0f9173b
Signed-off-by: Alexander Duyck 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 2 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 5544b50..352cf7c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2027,7 +2027,7 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
else
i40e_update_enable_itr(vsi, q_vector);
 
-   return 0;
+   return min(work_done, budget - 1);
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index c4b174a..df67ef3 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -1490,7 +1490,7 @@ int i40evf_napi_poll(struct napi_struct *napi, int budget)
else
i40e_update_enable_itr(vsi, q_vector);
 
-   return 0;
+   return min(work_done, budget - 1);
 }
 
 /**
-- 
2.9.3



[net-next 01/20] i40e: Driver prints log message on link speed change

2016-12-06 Thread Jeff Kirsher
From: Filip Sadowski 

This patch makes the driver log link speed change. Before applying the
patch link messages were printed only on state change. Now message is
printed when link is brought up or down and when speed changes.

Change-ID: Ifbee14b4b16c24967450b3cecac6e8351dcc8f74
Signed-off-by: Filip Sadowski 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h  | 1 +
 drivers/net/ethernet/intel/i40e/i40e_main.c | 6 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 4cb8fb3..06e3c23 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -596,6 +596,7 @@ struct i40e_vsi {
u16 veb_idx;/* index of VEB parent */
struct kobject *kobj;   /* sysfs object */
bool current_isup;  /* Sync 'link up' logging */
+   enum i40e_aq_link_speed current_speed;  /* Sync link speed logging */
 
void *priv; /* client driver data reference. */
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 5777e49..4534d41 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5225,12 +5225,16 @@ static int i40e_init_pf_dcb(struct i40e_pf *pf)
  */
 void i40e_print_link_message(struct i40e_vsi *vsi, bool isup)
 {
+   enum i40e_aq_link_speed new_speed;
char *speed = "Unknown";
char *fc = "Unknown";
 
-   if (vsi->current_isup == isup)
+   new_speed = vsi->back->hw.phy.link_info.link_speed;
+
+   if ((vsi->current_isup == isup) && (vsi->current_speed == new_speed))
return;
vsi->current_isup = isup;
+   vsi->current_speed = new_speed;
if (!isup) {
netdev_info(vsi->netdev, "NIC Link is Down\n");
return;
-- 
2.9.3



[net-next 12/20] i40e: lock service task correctly

2016-12-06 Thread Jeff Kirsher
From: Mitch Williams 

The service task lock was being set in the scheduling function, not the
actual service task. This would potentially leave the bit set for a long
time before the task actually ran. Furthermore, if the service task
takes too long, it calls the schedule function to reschedule itself -
which would fail to take the lock and do nothing.

Instead, set and clear the lock bit in the service task itself. In the
process, get rid of the i40e_service_event_complete() function, which is
really just two lines of code that can be put right in the service task
itself.

Change-ID: I83155e682b686121e2897f4429eb7d3f7c669168
Signed-off-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 24 +++-
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index b0486c9..c47e9c5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -288,8 +288,7 @@ struct i40e_vsi *i40e_find_vsi_from_id(struct i40e_pf *pf, 
u16 id)
 void i40e_service_event_schedule(struct i40e_pf *pf)
 {
if (!test_bit(__I40E_DOWN, >state) &&
-   !test_bit(__I40E_RESET_RECOVERY_PENDING, >state) &&
-   !test_and_set_bit(__I40E_SERVICE_SCHED, >state))
+   !test_bit(__I40E_RESET_RECOVERY_PENDING, >state))
queue_work(i40e_wq, >service_task);
 }
 
@@ -5955,19 +5954,6 @@ static void i40e_handle_lan_overflow_event(struct 
i40e_pf *pf,
 }
 
 /**
- * i40e_service_event_complete - Finish up the service event
- * @pf: board private structure
- **/
-static void i40e_service_event_complete(struct i40e_pf *pf)
-{
-   WARN_ON(!test_bit(__I40E_SERVICE_SCHED, >state));
-
-   /* flush memory to make sure state is correct before next watchog */
-   smp_mb__before_atomic();
-   clear_bit(__I40E_SERVICE_SCHED, >state);
-}
-
-/**
  * i40e_get_cur_guaranteed_fd_count - Get the consumed guaranteed FD filters
  * @pf: board private structure
  **/
@@ -7276,10 +7262,12 @@ static void i40e_service_task(struct work_struct *work)
 
/* don't bother with service tasks if a reset is in progress */
if (test_bit(__I40E_RESET_RECOVERY_PENDING, >state)) {
-   i40e_service_event_complete(pf);
return;
}
 
+   if (test_and_set_bit(__I40E_SERVICE_SCHED, >state))
+   return;
+
i40e_detect_recover_hung(pf);
i40e_sync_filters_subtask(pf);
i40e_reset_subtask(pf);
@@ -7292,7 +7280,9 @@ static void i40e_service_task(struct work_struct *work)
i40e_sync_udp_filters_subtask(pf);
i40e_clean_adminq_subtask(pf);
 
-   i40e_service_event_complete(pf);
+   /* flush memory to make sure state is correct before next watchdog */
+   smp_mb__before_atomic();
+   clear_bit(__I40E_SERVICE_SCHED, >state);
 
/* If the tasks have taken longer than one timer cycle or there
 * is more work to be done, reschedule the service task now
-- 
2.9.3



[net-next 05/20] i40e: remove code to handle dev_addr specially

2016-12-06 Thread Jeff Kirsher
From: Jacob Keller 

The netdev->dev_addr MAC filter already exists in the
MAC/VLAN hash table, as it is added when we configure
the netdev in i40e_configure_netdev. Because we already
know that this address will be updated in the
hash_for_each loops, we do not need to handle it
specially. This removes duplicate code and simplifies
the i40e_vsi_add_vlan and i40e_vsi_kill_vlan functions.
Because we know these filters must be part of the
MAC/VLAN hash table, this should not have any functional
impact on what filters are included and is merely a code
simplification.

Change-ID: I5e648302dbdd7cc29efc6d203b7019c11f0b5705
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 43 +
 1 file changed, 7 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index c467cc4..ae4a2b2 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -2515,17 +2515,6 @@ int i40e_vsi_add_vlan(struct i40e_vsi *vsi, s16 vid)
/* Locked once because all functions invoked below iterates list*/
spin_lock_bh(>mac_filter_hash_lock);
 
-   if (vsi->netdev) {
-   add_f = i40e_add_filter(vsi, vsi->netdev->dev_addr, vid);
-   if (!add_f) {
-   dev_info(>back->pdev->dev,
-"Could not add vlan filter %d for %pM\n",
-vid, vsi->netdev->dev_addr);
-   spin_unlock_bh(>mac_filter_hash_lock);
-   return -ENOMEM;
-   }
-   }
-
hash_for_each_safe(vsi->mac_filter_hash, bkt, h, f, hlist) {
if (f->state == I40E_FILTER_REMOVE)
continue;
@@ -2539,28 +2528,14 @@ int i40e_vsi_add_vlan(struct i40e_vsi *vsi, s16 vid)
}
}
 
-   /* Now if we add a vlan tag, make sure to check if it is the first
-* tag (i.e. a "tag" -1 does exist) and if so replace the -1 "tag"
-* with 0, so we now accept untagged and specified tagged traffic
-* (and not all tags along with untagged)
+   /* When we add a new VLAN filter, we need to make sure that all existing
+* filters which are marked as vid=-1 (I40E_VLAN_ANY) are converted to
+* vid=0. The simplest way is just search for all filters marked as
+* vid=-1 and replace them with vid=0. This converts all filters that
+* were marked to receive all traffic (tagged or untagged) into
+* filters to receive only untagged traffic, so that we don't receive
+* tagged traffic for VLANs which we have not configured.
 */
-   if (vid > 0 && vsi->netdev) {
-   del_f = i40e_find_filter(vsi, vsi->netdev->dev_addr,
-I40E_VLAN_ANY);
-   if (del_f) {
-   __i40e_del_filter(vsi, del_f);
-   add_f = i40e_add_filter(vsi, vsi->netdev->dev_addr, 0);
-   if (!add_f) {
-   dev_info(>back->pdev->dev,
-"Could not add filter 0 for %pM\n",
-vsi->netdev->dev_addr);
-   spin_unlock_bh(>mac_filter_hash_lock);
-   return -ENOMEM;
-   }
-   }
-   }
-
-   /* Do not assume that I40E_VLAN_ANY should be reset to VLAN 0 */
if (vid > 0 && !vsi->info.pvid) {
hash_for_each_safe(vsi->mac_filter_hash, bkt, h, f, hlist) {
if (f->state == I40E_FILTER_REMOVE)
@@ -2597,7 +2572,6 @@ int i40e_vsi_add_vlan(struct i40e_vsi *vsi, s16 vid)
  **/
 void i40e_vsi_kill_vlan(struct i40e_vsi *vsi, s16 vid)
 {
-   struct net_device *netdev = vsi->netdev;
struct i40e_mac_filter *f;
struct hlist_node *h;
int bkt;
@@ -2605,9 +2579,6 @@ void i40e_vsi_kill_vlan(struct i40e_vsi *vsi, s16 vid)
/* Locked once because all functions invoked below iterates list */
spin_lock_bh(>mac_filter_hash_lock);
 
-   if (vsi->netdev)
-   i40e_del_filter(vsi, netdev->dev_addr, vid);
-
hash_for_each_safe(vsi->mac_filter_hash, bkt, h, f, hlist) {
if (f->vlan == vid)
__i40e_del_filter(vsi, f);
-- 
2.9.3



[net-next 16/20] i40e: delete filter after adding its replacement when converting

2016-12-06 Thread Jeff Kirsher
From: Jacob Keller 

Fix a subtle issue with the code for converting VID=-1 filters into VID=0
filters when adding a new VLAN. Previously the code deleted the VID=-1
filter, and then added a new VID=0 filter. In the rare case that the
addition fails due to -ENOMEM, we end up completely deleting the filter
which prevents recovery if memory pressure subsides. While it is not
strictly an issue because it is likely that memory issues would result
in many other problems, we shouldn't delete the filter until after the
addition succeeds.

Change-ID: Icba07ddd04ecc6a3b27c2e29f2c1c8673d266826
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 8e65972..f9e9c90 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -2535,7 +2535,6 @@ int i40e_vsi_add_vlan(struct i40e_vsi *vsi, s16 vid)
 I40E_VLAN_ANY);
if (!del_f)
continue;
-   __i40e_del_filter(vsi, del_f);
add_f = i40e_add_filter(vsi, f->macaddr, 0);
if (!add_f) {
dev_info(>back->pdev->dev,
@@ -2544,6 +2543,7 @@ int i40e_vsi_add_vlan(struct i40e_vsi *vsi, s16 vid)
spin_unlock_bh(>mac_filter_hash_lock);
return -ENOMEM;
}
+   __i40e_del_filter(vsi, del_f);
}
}
 
-- 
2.9.3



[net-next 06/20] i40e: Blink LED on 1G BaseT boards

2016-12-06 Thread Jeff Kirsher
From: Henry Tieman 

Before this patch "ethtool -p" was not blinking the LEDs on boards
with 1G BaseT PHYs.

This commit identifies 1G BaseT boards as having the LEDs connected
to the MAC. Also, renamed the flag to be more descriptive of usage.
The flag is now I40E_FLAG_PHY_CONTROLS_LEDS.

Change-ID: I4eb741da9780da7849ddf2dc4c0cb27ffa42a801
Signed-off-by: Henry Tieman 
Signed-off-by: Harshitha Ramamurthy 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 10 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c|  4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 06e3c23..b8f2978 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -356,7 +356,7 @@ struct i40e_pf {
 #define I40E_FLAG_NO_DCB_SUPPORT   BIT_ULL(45)
 #define I40E_FLAG_USE_SET_LLDP_MIB BIT_ULL(46)
 #define I40E_FLAG_STOP_FW_LLDP BIT_ULL(47)
-#define I40E_FLAG_HAVE_10GBASET_PHYBIT_ULL(48)
+#define I40E_FLAG_PHY_CONTROLS_LEDSBIT_ULL(48)
 #define I40E_FLAG_PF_MAC   BIT_ULL(50)
 #define I40E_FLAG_TRUE_PROMISC_SUPPORT BIT_ULL(51)
 #define I40E_FLAG_HAVE_CRT_RETIMER BIT_ULL(52)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 76753e1..6ba0035 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -1890,7 +1890,7 @@ static int i40e_set_phys_id(struct net_device *netdev,
 
switch (state) {
case ETHTOOL_ID_ACTIVE:
-   if (!(pf->flags & I40E_FLAG_HAVE_10GBASET_PHY)) {
+   if (!(pf->flags & I40E_FLAG_PHY_CONTROLS_LEDS)) {
pf->led_status = i40e_led_get(hw);
} else {
i40e_aq_set_phy_debug(hw, I40E_PHY_DEBUG_ALL, NULL);
@@ -1900,20 +1900,20 @@ static int i40e_set_phys_id(struct net_device *netdev,
}
return blink_freq;
case ETHTOOL_ID_ON:
-   if (!(pf->flags & I40E_FLAG_HAVE_10GBASET_PHY))
+   if (!(pf->flags & I40E_FLAG_PHY_CONTROLS_LEDS))
i40e_led_set(hw, 0xf, false);
else
ret = i40e_led_set_phy(hw, true, pf->led_status, 0);
break;
case ETHTOOL_ID_OFF:
-   if (!(pf->flags & I40E_FLAG_HAVE_10GBASET_PHY))
+   if (!(pf->flags & I40E_FLAG_PHY_CONTROLS_LEDS))
i40e_led_set(hw, 0x0, false);
else
ret = i40e_led_set_phy(hw, false, pf->led_status, 0);
break;
case ETHTOOL_ID_INACTIVE:
-   if (!(pf->flags & I40E_FLAG_HAVE_10GBASET_PHY)) {
-   i40e_led_set(hw, false, pf->led_status);
+   if (!(pf->flags & I40E_FLAG_PHY_CONTROLS_LEDS)) {
+   i40e_led_set(hw, pf->led_status, false);
} else {
ret = i40e_led_set_phy(hw, false, pf->led_status,
   (pf->phy_led_val |
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index ae4a2b2..dbb854b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11380,8 +11380,8 @@ static int i40e_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
   pf->main_vsi_seid);
 
if ((pf->hw.device_id == I40E_DEV_ID_10G_BASE_T) ||
-   (pf->hw.device_id == I40E_DEV_ID_10G_BASE_T4))
-   pf->flags |= I40E_FLAG_HAVE_10GBASET_PHY;
+   (pf->hw.device_id == I40E_DEV_ID_10G_BASE_T4))
+   pf->flags |= I40E_FLAG_PHY_CONTROLS_LEDS;
if (pf->hw.device_id == I40E_DEV_ID_SFP_I_X722)
pf->flags |= I40E_FLAG_HAVE_CRT_RETIMER;
/* print a string summarizing features */
-- 
2.9.3



[net-next 18/20] i40e: use (add|rm)_vlan_all_mac helper functions when changing PVID

2016-12-06 Thread Jeff Kirsher
From: Jacob Keller 

The current flow for adding or updating the PVID for a VF uses
i40e_vsi_add_vlan and i40e_vsi_kill_vlan which each take, then release
the hash lock. In addition the two functions also must take special care
that they do not perform VLAN mode changes as this will make the code in
i40e_ndo_set_vf_port_vlan behave incorrectly.

Fix these issues by using the new helper functions i40e_add_vlan_all_mac
and i40e_rm_vlan_all_mac which expect the hash lock to already be taken.
Additionally these functions do not perform any state updates in regards
to VLAN mode, so they are safe to use in the PVID update flow.

It should be noted that we don't need the VLAN mode update code here,
because there are only a few flows here.

(a) we're adding a new PVID
  In this case, if we already had VLAN filters the VSI is knocked
  offline so we don't need to worry about pre-existing VLAN filters

(b) we're replacing an existing PVID
  In this case, we can't have any VLAN filters except those with the old
  PVID which we already take care of manually.

(c) we're removing an existing PVID
  Similarly to above, we can't have any existing VLAN filters except
  those with the old PVID which we already take care of correctly.

Because of this, we do not need (or even want) the special accounting
done in i40e_vsi_add_vlan, so use of the helpers is a saner alternative.
It also opens the door for a future patch which will refactor the flow
of i40e_vsi_add_vlan now that it is not needed in this function.

Change-ID: Ia841f63da94e12b106f41cf7d28ce8ce92f2ad99
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |  2 +
 drivers/net/ethernet/intel/i40e/i40e_main.c|  4 +-
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 43 ++
 3 files changed, 32 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index f1d838f..ba8d309 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -852,7 +852,9 @@ int i40e_open(struct net_device *netdev);
 int i40e_close(struct net_device *netdev);
 int i40e_vsi_open(struct i40e_vsi *vsi);
 void i40e_vlan_stripping_disable(struct i40e_vsi *vsi);
+int i40e_add_vlan_all_mac(struct i40e_vsi *vsi, s16 vid);
 int i40e_vsi_add_vlan(struct i40e_vsi *vsi, s16 vid);
+void i40e_rm_vlan_all_mac(struct i40e_vsi *vsi, s16 vid);
 void i40e_vsi_kill_vlan(struct i40e_vsi *vsi, s16 vid);
 struct i40e_mac_filter *i40e_put_mac_in_vlan(struct i40e_vsi *vsi,
 const u8 *macaddr);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 8aedfb7..49261cc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -2505,7 +2505,7 @@ static void i40e_vlan_rx_register(struct net_device 
*netdev, u32 features)
  * NOTE: this function expects to be called while under the
  * mac_filter_hash_lock
  **/
-static int i40e_add_vlan_all_mac(struct i40e_vsi *vsi, s16 vid)
+int i40e_add_vlan_all_mac(struct i40e_vsi *vsi, s16 vid)
 {
struct i40e_mac_filter *f, *add_f;
struct hlist_node *h;
@@ -2596,7 +2596,7 @@ int i40e_vsi_add_vlan(struct i40e_vsi *vsi, s16 vid)
  * NOTE: this function expects to be called while under the
  * mac_filter_hash_lock
  */
-static void i40e_rm_vlan_all_mac(struct i40e_vsi *vsi, s16 vid)
+void i40e_rm_vlan_all_mac(struct i40e_vsi *vsi, s16 vid)
 {
struct i40e_mac_filter *f;
struct hlist_node *h;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index d28b684..a6198b7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -2766,7 +2766,6 @@ int i40e_ndo_set_vf_port_vlan(struct net_device *netdev, 
int vf_id,
u16 vlanprio = vlan_id | (qos << I40E_VLAN_PRIORITY_SHIFT);
struct i40e_netdev_priv *np = netdev_priv(netdev);
struct i40e_pf *pf = np->vsi->back;
-   bool is_vsi_in_vlan = false;
struct i40e_vsi *vsi;
struct i40e_vf *vf;
int ret = 0;
@@ -2803,11 +2802,10 @@ int i40e_ndo_set_vf_port_vlan(struct net_device 
*netdev, int vf_id,
/* duplicate request, so just return success */
goto error_pvid;
 
+   /* Locked once because multiple functions below iterate list */
spin_lock_bh(>mac_filter_hash_lock);
-   is_vsi_in_vlan = i40e_is_vsi_in_vlan(vsi);
-   spin_unlock_bh(>mac_filter_hash_lock);
 
-   if (le16_to_cpu(vsi->info.pvid) == 0 && is_vsi_in_vlan) {
+   if (le16_to_cpu(vsi->info.pvid) == 0 && i40e_is_vsi_in_vlan(vsi)) {

[net-next 15/20] i40e: refactor i40e_update_filter_state to avoid passing aq_err

2016-12-06 Thread Jeff Kirsher
From: Jacob Keller 

The current caller of i40e_update_filter_state incorrectly passes
aq_ret, an i40e_status variable, instead of the expected aq_err. This
happens to work because i40e_status is actually just a typedef integer,
and 0 is still the successful return. However i40e_update_filter_state
has special handling for ENOSPC which is currently being ignored.

Also notice that firmware does not update the per-filter response for
many types of errors, such as EINVAL. Thus, modify the filter setup so
that the firmware response memory is pre-set with I40E_AQC_MM_ERR_NO_RES.

This enables us to refactor i40e_update_filter_state, removing the need
to pass aq_err and avoiding a need for having 3 different flows for
checking the filter state.

The resulting code for i40e_update_filter_state is much simpler, only
a single loop and we always check each filter response value every time.
Since we pre-set the response value to match our expected error this
correctly works for all success and error flows.

Change-ID: Ie292c9511f34ee18c6ef40f955ad13e28b7aea7d
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 58 +++--
 1 file changed, 21 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 2ccf376..8e65972 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -1757,7 +1757,6 @@ static void i40e_undo_filter_entries(struct i40e_vsi *vsi,
  * @count: Number of filters added
  * @add_list: return data from fw
  * @head: pointer to first filter in current batch
- * @aq_err: status from fw
  *
  * MAC filter entries from list were slated to be added to device. Returns
  * number of successful filters. Note that 0 does NOT mean success!
@@ -1765,47 +1764,30 @@ static void i40e_undo_filter_entries(struct i40e_vsi 
*vsi,
 static int
 i40e_update_filter_state(int count,
 struct i40e_aqc_add_macvlan_element_data *add_list,
-struct i40e_mac_filter *add_head, int aq_err)
+struct i40e_mac_filter *add_head)
 {
int retval = 0;
int i;
 
-
-   if (!aq_err) {
-   retval = count;
-   /* Everything's good, mark all filters active. */
-   for (i = 0; i < count ; i++) {
-   add_head->state = I40E_FILTER_ACTIVE;
-   add_head = hlist_entry(add_head->hlist.next,
-  typeof(struct i40e_mac_filter),
-  hlist);
-   }
-   } else if (aq_err == I40E_AQ_RC_ENOSPC) {
-   /* Device ran out of filter space. Check the return value
-* for each filter to see which ones are active.
+   for (i = 0; i < count; i++) {
+   /* Always check status of each filter. We don't need to check
+* the firmware return status because we pre-set the filter
+* status to I40E_AQC_MM_ERR_NO_RES when sending the filter
+* request to the adminq. Thus, if it no longer matches then
+* we know the filter is active.
 */
-   for (i = 0; i < count ; i++) {
-   if (add_list[i].match_method ==
-   I40E_AQC_MM_ERR_NO_RES) {
-   add_head->state = I40E_FILTER_FAILED;
-   } else {
-   add_head->state = I40E_FILTER_ACTIVE;
-   retval++;
-   }
-   add_head = hlist_entry(add_head->hlist.next,
-  typeof(struct i40e_mac_filter),
-  hlist);
-   }
-   } else {
-   /* Some other horrible thing happened, fail all filters */
-   retval = 0;
-   for (i = 0; i < count ; i++) {
+   if (add_list[i].match_method == I40E_AQC_MM_ERR_NO_RES) {
add_head->state = I40E_FILTER_FAILED;
-   add_head = hlist_entry(add_head->hlist.next,
-  typeof(struct i40e_mac_filter),
-  hlist);
+   } else {
+   add_head->state = I40E_FILTER_ACTIVE;
+   retval++;
}
+
+   add_head = hlist_entry(add_head->hlist.next,
+  typeof(struct i40e_mac_filter),
+  hlist);
}
+
return retval;
 }
 
@@ -1864,12 +1846,11 @@ void i40e_aqc_add_filters(struct i40e_vsi 

[net-next 02/20] i40e: simplify txd use count calculation

2016-12-06 Thread Jeff Kirsher
From: Mitch Williams 

The i40e_txd_use_count function was fast but confusing. In the comments,
it even admits that it's ugly. So replace it with a new function that is
(very) slightly faster and has extensive commenting to help the thicker
among us (including the author, who will forget in a week) understand
how it works.

Change-ID: Ifb533f13786a0bf39cb29f77969a5be2c83d9a87
Signed-off-by: Mitch Williams 
Signed-off-by: Alexander Duyck 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   | 45 +--
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h | 45 +--
 2 files changed, 56 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index de8550f..e065321 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -173,26 +173,37 @@ static inline bool i40e_test_staterr(union i40e_rx_desc 
*rx_desc,
 #define I40E_MAX_DATA_PER_TXD_ALIGNED \
(I40E_MAX_DATA_PER_TXD & ~(I40E_MAX_READ_REQ_SIZE - 1))
 
-/* This ugly bit of math is equivalent to DIV_ROUNDUP(size, X) where X is
- * the value I40E_MAX_DATA_PER_TXD_ALIGNED.  It is needed due to the fact
- * that 12K is not a power of 2 and division is expensive.  It is used to
- * approximate the number of descriptors used per linear buffer.  Note
- * that this will overestimate in some cases as it doesn't account for the
- * fact that we will add up to 4K - 1 in aligning the 12K buffer, however
- * the error should not impact things much as large buffers usually mean
- * we will use fewer descriptors then there are frags in an skb.
+/**
+ * i40e_txd_use_count  - estimate the number of descriptors needed for Tx
+ * @size: transmit request size in bytes
+ *
+ * Due to hardware alignment restrictions (4K alignment), we need to
+ * assume that we can have no more than 12K of data per descriptor, even
+ * though each descriptor can take up to 16K - 1 bytes of aligned memory.
+ * Thus, we need to divide by 12K. But division is slow! Instead,
+ * we decompose the operation into shifts and one relatively cheap
+ * multiply operation.
+ *
+ * To divide by 12K, we first divide by 4K, then divide by 3:
+ * To divide by 4K, shift right by 12 bits
+ * To divide by 3, multiply by 85, then divide by 256
+ * (Divide by 256 is done by shifting right by 8 bits)
+ * Finally, we add one to round up. Because 256 isn't an exact multiple of
+ * 3, we'll underestimate near each multiple of 12K. This is actually more
+ * accurate as we have 4K - 1 of wiggle room that we can fit into the last
+ * segment.  For our purposes this is accurate out to 1M which is orders of
+ * magnitude greater than our largest possible GSO size.
+ *
+ * This would then be implemented as:
+ * return (((size >> 12) * 85) >> 8) + 1;
+ *
+ * Since multiplication and division are commutative, we can reorder
+ * operations into:
+ * return ((size * 85) >> 20) + 1;
  */
 static inline unsigned int i40e_txd_use_count(unsigned int size)
 {
-   const unsigned int max = I40E_MAX_DATA_PER_TXD_ALIGNED;
-   const unsigned int reciprocal = ((1ull << 32) - 1 + (max / 2)) / max;
-   unsigned int adjust = ~(u32)0;
-
-   /* if we rounded up on the reciprocal pull down the adjustment */
-   if ((max * reciprocal) > adjust)
-   adjust = ~(u32)(reciprocal - 1);
-
-   return (u32)u64)size * reciprocal) + adjust) >> 32);
+   return ((size * 85) >> 20) + 1;
 }
 
 /* Tx Descriptors needed, worst case */
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
index a586e19..a5fc789 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
@@ -173,26 +173,37 @@ static inline bool i40e_test_staterr(union i40e_rx_desc 
*rx_desc,
 #define I40E_MAX_DATA_PER_TXD_ALIGNED \
(I40E_MAX_DATA_PER_TXD & ~(I40E_MAX_READ_REQ_SIZE - 1))
 
-/* This ugly bit of math is equivalent to DIV_ROUNDUP(size, X) where X is
- * the value I40E_MAX_DATA_PER_TXD_ALIGNED.  It is needed due to the fact
- * that 12K is not a power of 2 and division is expensive.  It is used to
- * approximate the number of descriptors used per linear buffer.  Note
- * that this will overestimate in some cases as it doesn't account for the
- * fact that we will add up to 4K - 1 in aligning the 12K buffer, however
- * the error should not impact things much as large buffers usually mean
- * we will use fewer descriptors then there are frags in an skb.
+/**
+ * i40e_txd_use_count  - estimate the number of descriptors needed for Tx
+ * @size: transmit request size in bytes
+ *
+ * Due to hardware alignment restrictions (4K alignment), we need to
+ * 

[net-next 13/20] i40e: defeature support for PTP L4 frame detection on XL710

2016-12-06 Thread Jeff Kirsher
From: Jacob Keller 

A product decision has been made to defeature detection of PTP frames
over L4 (UDP) on the XL710 MAC. Do not advertise support for L4
timestamping.

Change-ID: I41fbb0f84ebb27c43e23098c08156f2625c6ee06
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |  1 +
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 15 +--
 drivers/net/ethernet/intel/i40e/i40e_main.c|  3 ++-
 drivers/net/ethernet/intel/i40e/i40e_ptp.c | 21 +++--
 4 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index b8f2978..f1d838f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -360,6 +360,7 @@ struct i40e_pf {
 #define I40E_FLAG_PF_MAC   BIT_ULL(50)
 #define I40E_FLAG_TRUE_PROMISC_SUPPORT BIT_ULL(51)
 #define I40E_FLAG_HAVE_CRT_RETIMER BIT_ULL(52)
+#define I40E_FLAG_PTP_L4_CAPABLE   BIT_ULL(53)
 
/* tracks features that get auto disabled by errors */
u64 auto_disable_flags;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 935160a..cc1465a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -1681,8 +1681,19 @@ static int i40e_get_ts_info(struct net_device *dev,
info->tx_types = BIT(HWTSTAMP_TX_OFF) | BIT(HWTSTAMP_TX_ON);
 
info->rx_filters = BIT(HWTSTAMP_FILTER_NONE) |
-  BIT(HWTSTAMP_FILTER_PTP_V1_L4_EVENT) |
-  BIT(HWTSTAMP_FILTER_PTP_V2_EVENT);
+  BIT(HWTSTAMP_FILTER_PTP_V2_L2_EVENT) |
+  BIT(HWTSTAMP_FILTER_PTP_V2_L2_SYNC) |
+  BIT(HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ);
+
+   if (pf->flags & I40E_FLAG_PTP_L4_CAPABLE)
+   info->rx_filters |= BIT(HWTSTAMP_FILTER_PTP_V1_L4_SYNC) |
+   BIT(HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ) |
+   BIT(HWTSTAMP_FILTER_PTP_V2_EVENT) |
+   BIT(HWTSTAMP_FILTER_PTP_V2_L4_EVENT) |
+   BIT(HWTSTAMP_FILTER_PTP_V2_SYNC) |
+   BIT(HWTSTAMP_FILTER_PTP_V2_L4_SYNC) |
+   BIT(HWTSTAMP_FILTER_PTP_V2_DELAY_REQ) |
+   BIT(HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ);
 
return 0;
 }
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index c47e9c5..806fd56 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -8699,7 +8699,8 @@ static int i40e_sw_init(struct i40e_pf *pf)
 I40E_FLAG_MULTIPLE_TCP_UDP_RSS_PCTYPE |
 I40E_FLAG_NO_PCI_LINK_CHECK |
 I40E_FLAG_USE_SET_LLDP_MIB |
-I40E_FLAG_GENEVE_OFFLOAD_CAPABLE;
+I40E_FLAG_GENEVE_OFFLOAD_CAPABLE |
+I40E_FLAG_PTP_L4_CAPABLE;
} else if ((pf->hw.aq.api_maj_ver > 1) ||
   ((pf->hw.aq.api_maj_ver == 1) &&
(pf->hw.aq.api_min_ver > 4))) {
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ptp.c 
b/drivers/net/ethernet/intel/i40e/i40e_ptp.c
index 5e2272c..9e49ffa 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ptp.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ptp.c
@@ -521,6 +521,8 @@ static int i40e_ptp_set_timestamp_mode(struct i40e_pf *pf,
case HWTSTAMP_FILTER_PTP_V1_L4_SYNC:
case HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ:
case HWTSTAMP_FILTER_PTP_V1_L4_EVENT:
+   if (!(pf->flags & I40E_FLAG_PTP_L4_CAPABLE))
+   return -ERANGE;
pf->ptp_rx = true;
tsyntype = I40E_PRTTSYN_CTL1_V1MESSTYPE0_MASK |
   I40E_PRTTSYN_CTL1_TSYNTYPE_V1 |
@@ -528,19 +530,26 @@ static int i40e_ptp_set_timestamp_mode(struct i40e_pf *pf,
config->rx_filter = HWTSTAMP_FILTER_PTP_V1_L4_EVENT;
break;
case HWTSTAMP_FILTER_PTP_V2_EVENT:
-   case HWTSTAMP_FILTER_PTP_V2_L2_EVENT:
case HWTSTAMP_FILTER_PTP_V2_L4_EVENT:
case HWTSTAMP_FILTER_PTP_V2_SYNC:
-   case HWTSTAMP_FILTER_PTP_V2_L2_SYNC:
case HWTSTAMP_FILTER_PTP_V2_L4_SYNC:
case HWTSTAMP_FILTER_PTP_V2_DELAY_REQ:
-   case HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ:
case HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ:
+   if (!(pf->flags & I40E_FLAG_PTP_L4_CAPABLE))
+   return -ERANGE;
+   

[net-next 20/20] i40e: don't allow i40e_vsi_(add|kill)_vlan to operate when VID<1

2016-12-06 Thread Jeff Kirsher
From: Jacob Keller 

Now that we have the separate i40e_(add|rm)_vlan_all_mac functions, we
should not be using the i40e_vsi_kill_vlan or i40e_vsi_add_vlan
functions when PVID is set or when VID is less than 1. This allows us to
remove some checks in i40e_vsi_add_vlan and ensures that callers which
need to handle VID=0 or VID=-1 don't accidentally invoke the VLAN mode
handling used to convert filters when entering VLAN mode. We also update
the functions to take u16 instead of s16 as well since they no longer
expect to be called with VID=I40E_VLAN_ANY.

Change-ID: Ibddf44a8bb840dde8ceef2a4fdb92fd953b05a57
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h  |  4 ++--
 drivers/net/ethernet/intel/i40e/i40e_main.c | 14 ++
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index ba8d309..7f208f4 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -853,9 +853,9 @@ int i40e_close(struct net_device *netdev);
 int i40e_vsi_open(struct i40e_vsi *vsi);
 void i40e_vlan_stripping_disable(struct i40e_vsi *vsi);
 int i40e_add_vlan_all_mac(struct i40e_vsi *vsi, s16 vid);
-int i40e_vsi_add_vlan(struct i40e_vsi *vsi, s16 vid);
+int i40e_vsi_add_vlan(struct i40e_vsi *vsi, u16 vid);
 void i40e_rm_vlan_all_mac(struct i40e_vsi *vsi, s16 vid);
-void i40e_vsi_kill_vlan(struct i40e_vsi *vsi, s16 vid);
+void i40e_vsi_kill_vlan(struct i40e_vsi *vsi, u16 vid);
 struct i40e_mac_filter *i40e_put_mac_in_vlan(struct i40e_vsi *vsi,
 const u8 *macaddr);
 int i40e_del_mac_all_vlan(struct i40e_vsi *vsi, const u8 *macaddr);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index da4cbe3..148a678 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -2575,12 +2575,15 @@ int i40e_add_vlan_all_mac(struct i40e_vsi *vsi, s16 vid)
 /**
  * i40e_vsi_add_vlan - Add VSI membership for given VLAN
  * @vsi: the VSI being configured
- * @vid: VLAN id to be added (0 = untagged only , -1 = any)
+ * @vid: VLAN id to be added
  **/
-int i40e_vsi_add_vlan(struct i40e_vsi *vsi, s16 vid)
+int i40e_vsi_add_vlan(struct i40e_vsi *vsi, u16 vid)
 {
int err;
 
+   if (!(vid > 0) || vsi->info.pvid)
+   return -EINVAL;
+
/* Locked once because all functions invoked below iterates list*/
spin_lock_bh(>mac_filter_hash_lock);
err = i40e_add_vlan_all_mac(vsi, vid);
@@ -2623,10 +2626,13 @@ void i40e_rm_vlan_all_mac(struct i40e_vsi *vsi, s16 vid)
 /**
  * i40e_vsi_kill_vlan - Remove VSI membership for given VLAN
  * @vsi: the VSI being configured
- * @vid: VLAN id to be removed (0 = untagged only , -1 = any)
+ * @vid: VLAN id to be removed
  **/
-void i40e_vsi_kill_vlan(struct i40e_vsi *vsi, s16 vid)
+void i40e_vsi_kill_vlan(struct i40e_vsi *vsi, u16 vid)
 {
+   if (!(vid > 0) || vsi->info.pvid)
+   return;
+
spin_lock_bh(>mac_filter_hash_lock);
i40e_rm_vlan_all_mac(vsi, vid);
spin_unlock_bh(>mac_filter_hash_lock);
-- 
2.9.3



[net-next 19/20] i40e: move all updates for VLAN mode into i40e_sync_vsi_filters

2016-12-06 Thread Jeff Kirsher
From: Jacob Keller 

In a similar fashion to how we handled exiting VLAN mode, move the logic
in i40e_vsi_add_vlan into i40e_sync_vsi_filters. Extract this logic into
its own function for ease of understanding as it will become quite
complex.

The new function, i40e_correct_mac_vlan_filters() correctly updates all
filters for when we need to enter VLAN mode, exit VLAN mode, and also
enforces the PVID when assigned.

Call i40e_correct_mac_vlan_filters from i40e_sync_vsi_filters passing it
the number of active VLAN filters, and the two temporary lists.

Remove the function for updating VLAN=0 filters from i40e_vsi_add_vlan.

The end result is that the logic for entering and exiting VLAN mode is
in one location which has the most knowledge about all filters. This
ensures that we always correctly have the non-VLAN filters assigned to
VID=0 or VID=-1 regardless of how we ended up getting to this result.

Additionally this enforces the PVID at sync time so that we know for
certain that an assigned PVID results in only filters with that PVID
will be added to the firmware.

Change-ID: I895cee81e9c92d0a16baee38bd0ca51bbb14e372
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 214 +++-
 1 file changed, 113 insertions(+), 101 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 49261cc..da4cbe3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -1227,6 +1227,107 @@ bool i40e_is_vsi_in_vlan(struct i40e_vsi *vsi)
 }
 
 /**
+ * i40e_correct_mac_vlan_filters - Correct non-VLAN filters if necessary
+ * @vsi: the VSI to configure
+ * @tmp_add_list: list of filters ready to be added
+ * @tmp_del_list: list of filters ready to be deleted
+ * @vlan_filters: the number of active VLAN filters
+ *
+ * Update VLAN=0 and VLAN=-1 (I40E_VLAN_ANY) filters properly so that they
+ * behave as expected. If we have any active VLAN filters remaining or about
+ * to be added then we need to update non-VLAN filters to be marked as VLAN=0
+ * so that they only match against untagged traffic. If we no longer have any
+ * active VLAN filters, we need to make all non-VLAN filters marked as VLAN=-1
+ * so that they match against both tagged and untagged traffic. In this way,
+ * we ensure that we correctly receive the desired traffic. This ensures that
+ * when we have an active VLAN we will receive only untagged traffic and
+ * traffic matching active VLANs. If we have no active VLANs then we will
+ * operate in non-VLAN mode and receive all traffic, tagged or untagged.
+ *
+ * Finally, in a similar fashion, this function also corrects filters when
+ * there is an active PVID assigned to this VSI.
+ *
+ * In case of memory allocation failure return -ENOMEM. Otherwise, return 0.
+ *
+ * This function is only expected to be called from within
+ * i40e_sync_vsi_filters.
+ *
+ * NOTE: This function expects to be called while under the
+ * mac_filter_hash_lock
+ */
+static int i40e_correct_mac_vlan_filters(struct i40e_vsi *vsi,
+struct hlist_head *tmp_add_list,
+struct hlist_head *tmp_del_list,
+int vlan_filters)
+{
+   struct i40e_mac_filter *f, *add_head;
+   struct hlist_node *h;
+   int bkt, new_vlan;
+
+   /* To determine if a particular filter needs to be replaced we
+* have the three following conditions:
+*
+* a) if we have a PVID assigned, then all filters which are
+*not marked as VLAN=PVID must be replaced with filters that
+*are.
+* b) otherwise, if we have any active VLANS, all filters
+*which are marked as VLAN=-1 must be replaced with
+*filters marked as VLAN=0
+* c) finally, if we do not have any active VLANS, all filters
+*which are marked as VLAN=0 must be replaced with filters
+*marked as VLAN=-1
+*/
+
+   /* Update the filters about to be added in place */
+   hlist_for_each_entry(f, tmp_add_list, hlist) {
+   if (vsi->info.pvid && f->vlan != vsi->info.pvid)
+   f->vlan = vsi->info.pvid;
+   else if (vlan_filters && f->vlan == I40E_VLAN_ANY)
+   f->vlan = 0;
+   else if (!vlan_filters && f->vlan == 0)
+   f->vlan = I40E_VLAN_ANY;
+   }
+
+   /* Update the remaining active filters */
+   hash_for_each_safe(vsi->mac_filter_hash, bkt, h, f, hlist) {
+   /* Combine the checks for whether a filter needs to be changed
+* and then determine the new VLAN inside the if block, in
+* order to 

[net-next 14/20] i40e: recalculate vsi->active_filters from hash contents

2016-12-06 Thread Jeff Kirsher
From: Jacob Keller 

Previous code refactors have accidentally caused issues with the
counting of active_filters. Avoid similar issues in the future by simply
re-counting the active filters every time after we handle add and delete
of all the filters. Additionally this allows us to simplify the check
for when we exit promiscuous mode since we can combine the check for
failed filters at the same time.

Additionally since we recount filters at the end we need to set
vsi->promisc_threshold as well.

The resulting code takes a bit longer since we do have to loop over
filters again. However, the result is more readable and less likely to
become incorrect due to failed accounting of filters in the future.
Finally, this ensures that it is not possible for vsi->active_filters to
ever underflow since we never decrement it.

Change-ID: Ib4f3a377e60eb1fa6c91ea86cc02238c08edd102
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 52 -
 1 file changed, 29 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 806fd56..2ccf376 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -1870,12 +1870,10 @@ void i40e_aqc_add_filters(struct i40e_vsi *vsi, const 
char *vsi_name,
aq_ret = i40e_aq_add_macvlan(hw, vsi->seid, list, num_add, NULL);
aq_err = hw->aq.asq_last_status;
fcnt = i40e_update_filter_state(num_add, list, add_head, aq_ret);
-   vsi->active_filters += fcnt;
 
if (fcnt != num_add) {
*promisc_changed = true;
set_bit(__I40E_FILTER_OVERFLOW_PROMISC, >state);
-   vsi->promisc_threshold = (vsi->active_filters * 3) / 4;
dev_warn(>back->pdev->dev,
 "Error %s adding RX filters on %s, promiscuous mode 
forced on\n",
 i40e_aq_str(hw, aq_err),
@@ -1939,6 +1937,7 @@ int i40e_sync_vsi_filters(struct i40e_vsi *vsi)
struct i40e_hw *hw = >back->hw;
unsigned int vlan_any_filters = 0;
unsigned int non_vlan_filters = 0;
+   unsigned int failed_filters = 0;
unsigned int vlan_filters = 0;
bool promisc_changed = false;
char vsi_name[16] = "PF";
@@ -1985,7 +1984,6 @@ int i40e_sync_vsi_filters(struct i40e_vsi *vsi)
/* Move the element into temporary del_list */
hash_del(>hlist);
hlist_add_head(>hlist, _del_list);
-   vsi->active_filters--;
 
/* Avoid counting removed filters */
continue;
@@ -2046,7 +2044,6 @@ int i40e_sync_vsi_filters(struct i40e_vsi *vsi)
f->state = I40E_FILTER_REMOVE;
hash_del(>hlist);
hlist_add_head(>hlist, _del_list);
-   vsi->active_filters--;
}
 
/* Also update any filters on the tmp_add list */
@@ -2203,27 +2200,36 @@ int i40e_sync_vsi_filters(struct i40e_vsi *vsi)
add_list = NULL;
}
 
-   /* Check to see if we can drop out of overflow promiscuous mode. */
+   /* Determine the number of active and failed filters. */
+   spin_lock_bh(>mac_filter_hash_lock);
+   vsi->active_filters = 0;
+   hash_for_each(vsi->mac_filter_hash, bkt, f, hlist) {
+   if (f->state == I40E_FILTER_ACTIVE)
+   vsi->active_filters++;
+   else if (f->state == I40E_FILTER_FAILED)
+   failed_filters++;
+   }
+   spin_unlock_bh(>mac_filter_hash_lock);
+
+   /* If promiscuous mode has changed, we need to calculate a new
+* threshold for when we are safe to exit
+*/
+   if (promisc_changed)
+   vsi->promisc_threshold = (vsi->active_filters * 3) / 4;
+
+   /* Check if we are able to exit overflow promiscuous mode. We can
+* safely exit if we didn't just enter, we no longer have any failed
+* filters, and we have reduced filters below the threshold value.
+*/
if (test_bit(__I40E_FILTER_OVERFLOW_PROMISC, >state) &&
+   !promisc_changed && !failed_filters &&
(vsi->active_filters < vsi->promisc_threshold)) {
-   int failed_count = 0;
-   /* See if we have any failed filters. We can't drop out of
-* promiscuous until these have all been deleted.
-*/
-   spin_lock_bh(>mac_filter_hash_lock);
-   hash_for_each(vsi->mac_filter_hash, bkt, f, hlist) {
-   

[net-next 10/20] i40e: Add FEC for 25g

2016-12-06 Thread Jeff Kirsher
From: Carolyn Wyborny 

This patch adds adminq support for Forward Error
Correction ("FEC")for 25g products.

Change-ID: Iaff4910737c239d2c730e5c22a313ce9c37d3964
Signed-off-by: Carolyn Wyborny 
Signed-off-by: Mitch Williams 
Signed-off-by: Jacek Naczyk 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h  | 25 --
 drivers/net/ethernet/intel/i40e/i40e_common.c  |  2 ++
 .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h| 25 --
 3 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h 
b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
index c9d1f91..b2101a5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
@@ -1686,6 +1686,8 @@ struct i40e_aq_get_phy_abilities_resp {
 #define I40E_AQ_PHY_LINK_ENABLED   0x08
 #define I40E_AQ_PHY_AN_ENABLED 0x10
 #define I40E_AQ_PHY_FLAG_MODULE_QUAL   0x20
+#define I40E_AQ_PHY_FEC_ABILITY_KR 0x40
+#define I40E_AQ_PHY_FEC_ABILITY_RS 0x80
__le16  eee_capability;
 #define I40E_AQ_EEE_100BASE_TX 0x0002
 #define I40E_AQ_EEE_1000BASE_T 0x0004
@@ -1701,7 +1703,16 @@ struct i40e_aq_get_phy_abilities_resp {
 #define I40E_AQ_PHY_TYPE_EXT_25G_CR0X02
 #define I40E_AQ_PHY_TYPE_EXT_25G_SR0x04
 #define I40E_AQ_PHY_TYPE_EXT_25G_LR0x08
-   u8  mod_type_ext;
+   u8  fec_cfg_curr_mod_ext_info;
+#define I40E_AQ_ENABLE_FEC_KR  0x01
+#define I40E_AQ_ENABLE_FEC_RS  0x02
+#define I40E_AQ_REQUEST_FEC_KR 0x04
+#define I40E_AQ_REQUEST_FEC_RS 0x08
+#define I40E_AQ_ENABLE_FEC_AUTO0x10
+#define I40E_AQ_FEC
+#define I40E_AQ_MODULE_TYPE_EXT_MASK   0xE0
+#define I40E_AQ_MODULE_TYPE_EXT_SHIFT  5
+
u8  ext_comp_code;
u8  phy_id[4];
u8  module_type[3];
@@ -1729,7 +1740,15 @@ struct i40e_aq_set_phy_config { /* same bits as above in 
all */
 #define I40E_AQ_PHY_TYPE_EXT_25G_CR0X02
 #define I40E_AQ_PHY_TYPE_EXT_25G_SR0x04
 #define I40E_AQ_PHY_TYPE_EXT_25G_LR0x08
-   u8  reserved[2];
+   u8  fec_config;
+#define I40E_AQ_SET_FEC_ABILITY_KR BIT(0)
+#define I40E_AQ_SET_FEC_ABILITY_RS BIT(1)
+#define I40E_AQ_SET_FEC_REQUEST_KR BIT(2)
+#define I40E_AQ_SET_FEC_REQUEST_RS BIT(3)
+#define I40E_AQ_SET_FEC_AUTO   BIT(4)
+#define I40E_AQ_PHY_FEC_CONFIG_SHIFT   0x0
+#define I40E_AQ_PHY_FEC_CONFIG_MASK(0x1F << I40E_AQ_PHY_FEC_CONFIG_SHIFT)
+   u8  reserved;
 };
 
 I40E_CHECK_CMD_LENGTH(i40e_aq_set_phy_config);
@@ -1819,6 +1838,8 @@ struct i40e_aqc_get_link_status {
u8  loopback; /* use defines from i40e_aqc_set_lb_mode */
__le16  max_frame_size;
u8  config;
+#define I40E_AQ_CONFIG_FEC_KR_ENA  0x01
+#define I40E_AQ_CONFIG_FEC_RS_ENA  0x02
 #define I40E_AQ_CONFIG_CRC_ENA 0x04
 #define I40E_AQ_CONFIG_PACING_MASK 0x78
u8  external_power_ability;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c 
b/drivers/net/ethernet/intel/i40e/i40e_common.c
index 1318c7d..f8c4c14 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -1714,6 +1714,8 @@ enum i40e_status_code i40e_set_fc(struct i40e_hw *hw, u8 
*aq_failures,
config.eee_capability = abilities.eee_capability;
config.eeer = abilities.eeer_val;
config.low_power_ctrl = abilities.d3_lpan;
+   config.fec_config = abilities.fec_cfg_curr_mod_ext_info &
+   I40E_AQ_PHY_FEC_CONFIG_MASK;
status = i40e_aq_set_phy_config(hw, , NULL);
 
if (status)
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h 
b/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
index f8d7d95..eeb9864 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
@@ -1683,6 +1683,8 @@ struct i40e_aq_get_phy_abilities_resp {
 #define I40E_AQ_PHY_LINK_ENABLED   0x08
 #define I40E_AQ_PHY_AN_ENABLED 0x10
 #define I40E_AQ_PHY_FLAG_MODULE_QUAL   0x20
+#define I40E_AQ_PHY_FEC_ABILITY_KR 0x40
+#define I40E_AQ_PHY_FEC_ABILITY_RS 0x80
__le16  eee_capability;
 #define I40E_AQ_EEE_100BASE_TX 0x0002
 #define I40E_AQ_EEE_1000BASE_T 0x0004
@@ -1698,7 +1700,16 @@ struct i40e_aq_get_phy_abilities_resp {
 #define I40E_AQ_PHY_TYPE_EXT_25G_CR0X02
 #define I40E_AQ_PHY_TYPE_EXT_25G_SR0x04
 #define I40E_AQ_PHY_TYPE_EXT_25G_LR0x08
-   u8  mod_type_ext;
+   u8  fec_cfg_curr_mod_ext_info;
+#define I40E_AQ_ENABLE_FEC_KR  0x01
+#define 

Re: [PATCH 10/10] virtio: enable endian checks for sparse builds

2016-12-06 Thread Christoph Hellwig
On Tue, Dec 06, 2016 at 05:41:05PM +0200, Michael S. Tsirkin wrote:
> __CHECK_ENDIAN__ isn't on by default presumably because
> it triggers too many sparse warnings for correct code.
> But virtio is now clean of these warnings, and
> we want to keep it this way - enable this for
> sparse builds.
> 
> Signed-off-by: Michael S. Tsirkin 

Nah.  Please just enable it globally when using sparse.  I actually
had a chat with Linus about that a while ago and he seemed generally
fine with it, I just didn't manage to actually do it..


Re: [PATCH net-next v2 0/7] bnxt_en: Add interface to support RDMA driver.

2016-12-06 Thread Christoph Hellwig
On Wed, Dec 07, 2016 at 12:26:14AM -0500, Michael Chan wrote:
> This series adds an interface to support a brand new RDMA driver bnxt_re.
> The first step is to re-arrange some code so that pci_enable_msix() can
> be called during pci probe.  The purpose is to allow the RDMA driver to
> initialize and stay initialized whether the netdev is up or down.

Please switch form pci_enable_msix to pci_alloc_irq_vectors for any
changes to MSI-X code, thanks!


Re: linux-next: manual merge of the staging tree with the net-next tree

2016-12-06 Thread Greg KH
On Wed, Dec 07, 2016 at 03:04:47PM +1100, Stephen Rothwell wrote:
> Hi Greg,
> 
> Today's linux-next merge of the staging tree got a conflict in:
> 
>   drivers/staging/slicoss/slicoss.c
> 
> between commit:
> 
>   a52ad514fdf3 ("net: deprecate eth_change_mtu, remove usage")
> 
> from the net-next tree and commit:
> 
>   0af72df267f2 ("staging: slicoss: remove the staging driver")
> 
> from the staging tree.
> 
> I fixed it up (I just removed the file) and can carry the fix as
> necessary. This is now fixed as far as linux-next is concerned, but any
> non trivial conflicts should be mentioned to your upstream maintainer
> when your tree is submitted for merging.  You may also want to consider
> cooperating with the maintainer of the conflicting tree to minimise any
> particularly complex conflicts.

Thanks, we did coordinate this :)

greg k-h


[net-next] icmp: correct return value of icmp_rcv()

2016-12-06 Thread Zhang Shengju
Currently, icmp_rcv() always return zero on a packet delivery upcall.

To make its behavior more compliant with the way this API should be
used, this patch changes this to let it return NET_RX_SUCCESS when the
packet is proper handled, and NET_RX_DROP otherwise.

Signed-off-by: Zhang Shengju 
---
 net/ipv4/icmp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 691146a..f79d7a8 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -1047,12 +1047,12 @@ int icmp_rcv(struct sk_buff *skb)
 
if (success)  {
consume_skb(skb);
-   return 0;
+   return NET_RX_SUCCESS;
}
 
 drop:
kfree_skb(skb);
-   return 0;
+   return NET_RX_DROP;
 csum_error:
__ICMP_INC_STATS(net, ICMP_MIB_CSUMERRORS);
 error:
-- 
1.8.3.1





Re: [PATCH] net: return value of skb_linearize should be handled in Linux kernel

2016-12-06 Thread Eric Dumazet
On Tue, 2016-12-06 at 15:10 +0800, Zhouyi Zhou wrote:
> kmalloc_reserve may fail to allocate memory inside skb_linearize, 
> which means skb_linearize's return value should not be ignored. 
> Following patch correct the uses of skb_linearize.
> 
> Compiled in x86_64
> 
> Signed-off-by: Zhouyi Zhou 
> ---
>  drivers/infiniband/hw/nes/nes_nic.c   | 5 +++--
>  drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c | 6 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 +--
>  drivers/scsi/bnx2fc/bnx2fc_fcoe.c | 7 +--
>  drivers/scsi/fcoe/fcoe.c  | 5 -
>  net/tipc/link.c   | 3 ++-
>  net/tipc/name_distr.c | 5 -
>  7 files changed, 24 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/nes/nes_nic.c 
> b/drivers/infiniband/hw/nes/nes_nic.c
> index 2b27d13..69372ea 100644
> --- a/drivers/infiniband/hw/nes/nes_nic.c
> +++ b/drivers/infiniband/hw/nes/nes_nic.c
> @@ -662,10 +662,11 @@ static int nes_netdev_start_xmit(struct sk_buff *skb, 
> struct net_device *netdev)
>   nesnic->sq_head &= nesnic->sq_size-1;
>   }
>   } else {
> - nesvnic->linearized_skbs++;
>   hoffset = skb_transport_header(skb) - skb->data;
>   nhoffset = skb_network_header(skb) - skb->data;
> - skb_linearize(skb);
> + if (skb_linearize(skb))
> + return NETDEV_TX_BUSY;

This would live lock.

Please drop the packet.

You probably should send one patch per driver, to ease code review and
acceptance.





Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets

2016-12-06 Thread Eric Dumazet
On Tue, 2016-12-06 at 19:32 -0800, Eric Dumazet wrote:
> A follow up patch will provide a static_key (Jump Label) since most
> hosts do not even use RFS.

Speaking of static_key, it appears we now have GRO on UDP, and this
consumes a considerable amount of cpu cycles.

Turning off GRO allows me to get +20 % more packets on my single UDP
socket. (1.2 Mpps instead of 1.0 Mpps)

Surely udp_gro_receive() should be bypassed if no UDP socket has
registered a udp_sk(sk)->gro_receive handler 

And/or delay the inet_add_offload({4|6}_offload, IPPROTO_UDP); to
the first UDP sockets setting udp_sk(sk)->gro_receive handler,
ie udp_encap_enable() and udpv6_encap_enable()


:(





Re: [PATCH] net: return value of skb_linearize should be handled in Linux kernel

2016-12-06 Thread Zhouyi Zhou
On Wed, Dec 7, 2016 at 1:02 PM, Cong Wang  wrote:
> On Mon, Dec 5, 2016 at 11:10 PM, Zhouyi Zhou  wrote:
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c 
>> b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
>> index 2a653ec..ab787cb 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
>> @@ -490,7 +490,11 @@ int ixgbe_fcoe_ddp(struct ixgbe_adapter *adapter,
>>  */
>> if ((fh->fh_r_ctl == FC_RCTL_DD_SOL_DATA) &&
>> (fctl & FC_FC_END_SEQ)) {
>> -   skb_linearize(skb);
>> +   int err = 0;
>> +
>> +   err = skb_linearize(skb);
>> +   if (err)
>> +   return err;
>
>
> You can reuse 'rc' instead of adding 'err'.
rc here is meaningful for the length of data being ddped. If using rc
here, a successful
skb_linearize will assign rc to 0.
>
>
>
>> crc = (struct fcoe_crc_eof *)skb_put(skb, sizeof(*crc));
>> crc->fcoe_eof = FC_EOF_T;
>> }
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
>> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> index fee1f29..4926d48 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> @@ -2173,8 +2173,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
>> *q_vector,
>> total_rx_bytes += ddp_bytes;
>> total_rx_packets += DIV_ROUND_UP(ddp_bytes,
>>  mss);
>> -   }
>> -   if (!ddp_bytes) {
>> +   } else {
>> dev_kfree_skb_any(skb);
>> continue;
>> }
>
>
> This piece doesn't seem to be related.
if ddp_bytes is negative there will be some error, I think the skb
should not pass to upper layer.


Re: [PATCH 10/10] virtio: enable endian checks for sparse builds

2016-12-06 Thread Johannes Berg
On Tue, 2016-12-06 at 17:41 +0200, Michael S. Tsirkin wrote:

> It seems that there should be a better way to do it,
> but this works too.

In some cases there might be:

> --- a/drivers/s390/virtio/Makefile
> +++ b/drivers/s390/virtio/Makefile
> @@ -6,6 +6,8 @@
>  # it under the terms of the GNU General Public License (version 2
> only)
>  # as published by the Free Software Foundation.
>  
> +CFLAGS_virtio_ccw.o += -D__CHECK_ENDIAN__
> +CFLAGS_kvm_virtio.o += -D__CHECK_ENDIAN__
>  s390-virtio-objs := virtio_ccw.o
>  ifdef CONFIG_S390_GUEST_OLD_TRANSPORT
>  s390-virtio-objs += kvm_virtio.o

Here you could use

ccflags-y += -D__CHECK_ENDIAN__

for example, or even

subdir-ccflags-y += -D__CHECK_ENDIAN__

(in case any subdirs ever get added here)

> --- a/drivers/vhost/Makefile
> +++ b/drivers/vhost/Makefile
> @@ -1,3 +1,4 @@
> +ccflags-y := -D__CHECK_ENDIAN__

Looks like you did that here and in some other places though - so
perhaps the s390 one was intentionally different?

> --- a/net/packet/Makefile
> +++ b/net/packet/Makefile
> @@ -2,6 +2,7 @@
>  # Makefile for the packet AF.
>  #
>  
> +ccflags-y := -D__CHECK_ENDIAN__

Technically this is slightly more than advertised, but I guess that
still makes sense if it's clean now.

johannes



Re: Oops with CONFIG_VMAP_STCK and bond device + virtio-net

2016-12-06 Thread Cong Wang
On Mon, Dec 5, 2016 at 3:53 PM, Laura Abbott  wrote:
> This looks like an issue with CONFIG_VMAP_STACK since bond_enslave uses
> struct sockaddr from the stack and virtnet_set_mac_address calls
> sg_init_one which triggers BUG_ON(!virt_addr_valid(buf));
>
> I know there have been a lot of CONFIG_VMAP_STACK fixes around but I
> didn't find this one reported yet.

Fixed by:

commit e37e2ff350a321ad9c36b588e76f34fbba305be6
Author: Andy Lutomirski 
Date:   Mon Dec 5 18:10:58 2016 -0800

virtio-net: Fix DMA-from-the-stack in virtnet_set_mac_address()


[PATCH v3 net-next 1/4] bpf: xdp: Allow head adjustment in XDP prog

2016-12-06 Thread Martin KaFai Lau
This patch allows XDP prog to extend/remove the packet
data at the head (like adding or removing header).  It is
done by adding a new XDP helper bpf_xdp_adjust_head().

It also renames bpf_helper_changes_skb_data() to
bpf_helper_changes_pkt_data() to better reflect
that XDP prog does not work on skb.

To avoid breaking unsupported drivers, this patch
also does the needed checking before setting
the xdp_prog fd to the device.  It is done by
1) Adding a XDP_QUERY_FEATURES command
2) Adding one "xdp_adjust_head" bit to bpf_prog

Acked-by: Alexei Starovoitov 
Signed-off-by: Martin KaFai Lau 
---
 arch/powerpc/net/bpf_jit_comp64.c  |  4 ++--
 arch/s390/net/bpf_jit_comp.c   |  2 +-
 arch/x86/net/bpf_jit_comp.c|  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  3 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  3 +++
 .../net/ethernet/netronome/nfp/nfp_net_common.c|  3 +++
 drivers/net/ethernet/qlogic/qede/qede_main.c   |  3 +++
 include/linux/filter.h |  6 +++--
 include/linux/netdevice.h  | 12 ++
 include/uapi/linux/bpf.h   | 11 -
 kernel/bpf/core.c  |  2 +-
 kernel/bpf/syscall.c   |  2 ++
 kernel/bpf/verifier.c  |  2 +-
 net/core/dev.c |  9 +++
 net/core/filter.c  | 28 --
 15 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
b/arch/powerpc/net/bpf_jit_comp64.c
index 0fe98a567125..73a5cf18fd84 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -766,7 +766,7 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
func = (u8 *) __bpf_call_base + imm;
 
/* Save skb pointer if we need to re-cache skb data */
-   if (bpf_helper_changes_skb_data(func))
+   if (bpf_helper_changes_pkt_data(func))
PPC_BPF_STL(3, 1, bpf_jit_stack_local(ctx));
 
bpf_jit_emit_func_call(image, ctx, (u64)func);
@@ -775,7 +775,7 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
PPC_MR(b2p[BPF_REG_0], 3);
 
/* refresh skb cache */
-   if (bpf_helper_changes_skb_data(func)) {
+   if (bpf_helper_changes_pkt_data(func)) {
/* reload skb pointer to r3 */
PPC_BPF_LL(3, 1, bpf_jit_stack_local(ctx));
bpf_jit_emit_skb_loads(image, ctx);
diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
index bee281f3163d..167b31b186c1 100644
--- a/arch/s390/net/bpf_jit_comp.c
+++ b/arch/s390/net/bpf_jit_comp.c
@@ -981,7 +981,7 @@ static noinline int bpf_jit_insn(struct bpf_jit *jit, 
struct bpf_prog *fp, int i
EMIT2(0x0d00, REG_14, REG_W1);
/* lgr %b0,%r2: load return value into %b0 */
EMIT4(0xb904, BPF_REG_0, REG_2);
-   if (bpf_helper_changes_skb_data((void *)func)) {
+   if (bpf_helper_changes_pkt_data((void *)func)) {
jit->seen |= SEEN_SKB_CHANGE;
/* lg %b1,ST_OFF_SKBP(%r15) */
EMIT6_DISP_LH(0xe300, 0x0004, BPF_REG_1, REG_0,
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index fe04a04dab8e..e76d1af60f7a 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -853,7 +853,7 @@ xadd:   if (is_imm8(insn->off))
func = (u8 *) __bpf_call_base + imm32;
jmp_offset = func - (image + addrs[i]);
if (seen_ld_abs) {
-   reload_skb_data = 
bpf_helper_changes_skb_data(func);
+   reload_skb_data = 
bpf_helper_changes_pkt_data(func);
if (reload_skb_data) {
EMIT1(0x57); /* push %rdi */
jmp_offset += 22; /* pop, mov, sub, mov 
*/
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 49a81f1fc1d6..6261157f444e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2794,6 +2794,9 @@ static int mlx4_xdp(struct net_device *dev, struct 
netdev_xdp *xdp)
case XDP_QUERY_PROG:
xdp->prog_attached = mlx4_xdp_attached(dev);
return 0;
+   case XDP_QUERY_FEATURES:
+   xdp->features = 0;
+   return 0;
default:
 

[PATCH v3 net-next 4/4] bpf: xdp: Add XDP example for head adjustment

2016-12-06 Thread Martin KaFai Lau
The XDP prog checks if the incoming packet matches any VIP:PORT
combination in the BPF hashmap.  If it is, it will encapsulate
the packet with a IPv4/v6 header as instructed by the value of
the BPF hashmap and then XDP_TX it out.

The VIP:PORT -> IP-Encap-Info can be specified by the cmd args
of the user prog.

Acked-by: Alexei Starovoitov 
Signed-off-by: Martin KaFai Lau 
---
 samples/bpf/Makefile  |   4 +
 samples/bpf/bpf_helpers.h |   2 +
 samples/bpf/bpf_load.c|  94 ++
 samples/bpf/bpf_load.h|   1 +
 samples/bpf/xdp1_user.c   |  93 --
 samples/bpf/xdp_tx_iptnl_common.h |  37 ++
 samples/bpf/xdp_tx_iptnl_kern.c   | 232 ++
 samples/bpf/xdp_tx_iptnl_user.c   | 253 ++
 8 files changed, 623 insertions(+), 93 deletions(-)
 create mode 100644 samples/bpf/xdp_tx_iptnl_common.h
 create mode 100644 samples/bpf/xdp_tx_iptnl_kern.c
 create mode 100644 samples/bpf/xdp_tx_iptnl_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 00cd3081c038..f78e0ef6ff10 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -33,6 +33,7 @@ hostprogs-y += trace_event
 hostprogs-y += sampleip
 hostprogs-y += tc_l2_redirect
 hostprogs-y += lwt_len_hist
+hostprogs-y += xdp_tx_iptnl
 
 test_lru_dist-objs := test_lru_dist.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
@@ -67,6 +68,7 @@ trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
 sampleip-objs := bpf_load.o libbpf.o sampleip_user.o
 tc_l2_redirect-objs := bpf_load.o libbpf.o tc_l2_redirect_user.o
 lwt_len_hist-objs := bpf_load.o libbpf.o lwt_len_hist_user.o
+xdp_tx_iptnl-objs := bpf_load.o libbpf.o xdp_tx_iptnl_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -99,6 +101,7 @@ always += test_current_task_under_cgroup_kern.o
 always += trace_event_kern.o
 always += sampleip_kern.o
 always += lwt_len_hist_kern.o
+always += xdp_tx_iptnl_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/testing/selftests/bpf/
@@ -129,6 +132,7 @@ HOSTLOADLIBES_trace_event += -lelf
 HOSTLOADLIBES_sampleip += -lelf
 HOSTLOADLIBES_tc_l2_redirect += -l elf
 HOSTLOADLIBES_lwt_len_hist += -l elf
+HOSTLOADLIBES_xdp_tx_iptnl += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 8370a6e3839d..faaffe2e139a 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -57,6 +57,8 @@ static int (*bpf_skb_set_tunnel_opt)(void *ctx, void *md, int 
size) =
(void *) BPF_FUNC_skb_set_tunnel_opt;
 static unsigned long long (*bpf_get_prandom_u32)(void) =
(void *) BPF_FUNC_get_prandom_u32;
+static int (*bpf_xdp_adjust_head)(void *ctx, int offset) =
+   (void *) BPF_FUNC_xdp_adjust_head;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 49b45ccbe153..e30b6de94f2e 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -12,6 +12,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -450,3 +454,93 @@ struct ksym *ksym_search(long key)
/* out of range. return _stext */
return [0];
 }
+
+int set_link_xdp_fd(int ifindex, int fd)
+{
+   struct sockaddr_nl sa;
+   int sock, seq = 0, len, ret = -1;
+   char buf[4096];
+   struct nlattr *nla, *nla_xdp;
+   struct {
+   struct nlmsghdr  nh;
+   struct ifinfomsg ifinfo;
+   char attrbuf[64];
+   } req;
+   struct nlmsghdr *nh;
+   struct nlmsgerr *err;
+
+   memset(, 0, sizeof(sa));
+   sa.nl_family = AF_NETLINK;
+
+   sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+   if (sock < 0) {
+   printf("open netlink socket: %s\n", strerror(errno));
+   return -1;
+   }
+
+   if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) {
+   printf("bind to netlink: %s\n", strerror(errno));
+   goto cleanup;
+   }
+
+   memset(, 0, sizeof(req));
+   req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+   req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+   req.nh.nlmsg_type = RTM_SETLINK;
+   req.nh.nlmsg_pid = 0;
+   req.nh.nlmsg_seq = ++seq;
+   req.ifinfo.ifi_family = AF_UNSPEC;
+   req.ifinfo.ifi_index = ifindex;
+   nla = (struct nlattr *)(((char *))
+   + NLMSG_ALIGN(req.nh.nlmsg_len));
+   nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
+
+   nla_xdp = (struct nlattr *)((char *)nla + NLA_HDRLEN);
+   nla_xdp->nla_type = 

[PATCH v3 net-next 3/4] mlx4: xdp: Reserve headroom for receiving packet when XDP prog is active

2016-12-06 Thread Martin KaFai Lau
Reserve XDP_PACKET_HEADROOM for packet and enable bpf_xdp_adjust_head()
support.  This patch only affects the code path when XDP is active.

After testing, the tx_dropped counter is incremented if the xdp_prog sends
more than wire MTU.

Signed-off-by: Martin KaFai Lau 
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  5 +++--
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 24 ++--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c |  9 +
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  3 ++-
 4 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 5482591688f8..36b9bb042778 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -51,7 +51,8 @@
 #include "mlx4_en.h"
 #include "en_port.h"
 
-#define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN)))
+#define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) - \
+  XDP_PACKET_HEADROOM))
 
 int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 {
@@ -2807,7 +2808,7 @@ static int mlx4_xdp(struct net_device *dev, struct 
netdev_xdp *xdp)
xdp->prog_attached = mlx4_xdp_attached(dev);
return 0;
case XDP_QUERY_FEATURES:
-   xdp->features = 0;
+   xdp->features = XDP_F_ADJUST_HEAD;
return 0;
default:
return -EINVAL;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 23e9d04d1ef4..3c37e216bbf3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -96,7 +96,6 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
struct mlx4_en_rx_alloc page_alloc[MLX4_EN_MAX_RX_FRAGS];
const struct mlx4_en_frag_info *frag_info;
struct page *page;
-   dma_addr_t dma;
int i;
 
for (i = 0; i < priv->num_frags; i++) {
@@ -115,9 +114,10 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
 
for (i = 0; i < priv->num_frags; i++) {
frags[i] = ring_alloc[i];
-   dma = ring_alloc[i].dma + ring_alloc[i].page_offset;
+   frags[i].page_offset += priv->frag_info[i].rx_headroom;
+   rx_desc->data[i].addr = cpu_to_be64(frags[i].dma +
+   frags[i].page_offset);
ring_alloc[i] = page_alloc[i];
-   rx_desc->data[i].addr = cpu_to_be64(dma);
}
 
return 0;
@@ -250,7 +250,8 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv 
*priv,
 
if (ring->page_cache.index > 0) {
frags[0] = ring->page_cache.buf[--ring->page_cache.index];
-   rx_desc->data[0].addr = cpu_to_be64(frags[0].dma);
+   rx_desc->data[0].addr = cpu_to_be64(frags[0].dma +
+   frags[0].page_offset);
return 0;
}
 
@@ -889,6 +890,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
if (xdp_prog) {
struct xdp_buff xdp;
dma_addr_t dma;
+   void *orig_data;
u32 act;
 
dma = be64_to_cpu(rx_desc->data[0].addr);
@@ -896,11 +898,19 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
priv->frag_info[0].frag_size,
DMA_FROM_DEVICE);
 
-   xdp.data = page_address(frags[0].page) +
-   frags[0].page_offset;
+   xdp.data_hard_start = page_address(frags[0].page);
+   xdp.data = xdp.data_hard_start + frags[0].page_offset;
xdp.data_end = xdp.data + length;
+   orig_data = xdp.data;
 
act = bpf_prog_run_xdp(xdp_prog, );
+
+   if (xdp.data != orig_data) {
+   length = xdp.data_end - xdp.data;
+   frags[0].page_offset = xdp.data -
+   xdp.data_hard_start;
+   }
+
switch (act) {
case XDP_PASS:
break;
@@ -1180,6 +1190,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 */
priv->frag_info[0].frag_stride = PAGE_SIZE;
priv->frag_info[0].dma_dir = PCI_DMA_BIDIRECTIONAL;
+   priv->frag_info[0].rx_headroom = XDP_PACKET_HEADROOM;
i = 1;
} else {
int buf_size = 0;
@@ -1194,6 +1205,7 

[PATCH v3 net-next 0/4]: Allow head adjustment in XDP prog

2016-12-06 Thread Martin KaFai Lau
This series adds a helper to allow head adjusting in XDP prog.  mlx4
driver has been modified to support this feature.  An example is written
to encapsulate a packet with an IPv4/v6 header and then XDP_TX it
out.

v3:
1. Check if the driver supports head adjustment before
   setting the xdp_prog fd to the device in patch 1of4.
2. Remove the page alignment assumption on the data_hard_start.
   Instead, add data_hard_start to the struct xdp_buff and the
   driver has to fill it if it supports head adjustment.
3. Keep the wire MTU as before in mlx4
4. Set map0_byte_count to PAGE_SIZE in patch 3of4

v2:
1. Make a variable name change in bpf_xdp_adjust_head() in patch 1
2. Ensure no less than ETH_HLEN data in bpf_xdp_adjust_head() in patch 1
3. Some clarifications in commit log messages of patch 2 and 3

Thanks,
Martin

Martin KaFai Lau (4):
  bpf: xdp: Allow head adjustment in XDP prog
  mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs
  mlx4: xdp: Reserve headroom for receiving packet when XDP prog is
active
  bpf: xdp: Add XDP example for head adjustment

 arch/powerpc/net/bpf_jit_comp64.c  |   4 +-
 arch/s390/net/bpf_jit_comp.c   |   2 +-
 arch/x86/net/bpf_jit_comp.c|   2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  32 ++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |  70 +++---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c |   9 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |   3 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   3 +
 .../net/ethernet/netronome/nfp/nfp_net_common.c|   3 +
 drivers/net/ethernet/qlogic/qede/qede_main.c   |   3 +
 include/linux/filter.h |   6 +-
 include/linux/netdevice.h  |  12 +
 include/uapi/linux/bpf.h   |  11 +-
 kernel/bpf/core.c  |   2 +-
 kernel/bpf/syscall.c   |   2 +
 kernel/bpf/verifier.c  |   2 +-
 net/core/dev.c |   9 +
 net/core/filter.c  |  28 ++-
 samples/bpf/Makefile   |   4 +
 samples/bpf/bpf_helpers.h  |   2 +
 samples/bpf/bpf_load.c |  94 
 samples/bpf/bpf_load.h |   1 +
 samples/bpf/xdp1_user.c|  93 
 samples/bpf/xdp_tx_iptnl_common.h  |  37 +++
 samples/bpf/xdp_tx_iptnl_kern.c| 232 +++
 samples/bpf/xdp_tx_iptnl_user.c| 253 +
 26 files changed, 774 insertions(+), 145 deletions(-)
 create mode 100644 samples/bpf/xdp_tx_iptnl_common.h
 create mode 100644 samples/bpf/xdp_tx_iptnl_kern.c
 create mode 100644 samples/bpf/xdp_tx_iptnl_user.c

-- 
2.5.1



[PATCH v3 net-next 2/4] mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs

2016-12-06 Thread Martin KaFai Lau
When XDP is active in mlx4, mlx4 is using one page/pkt.
At the same time (i.e. when XDP is active), it is currently
limiting MTU to be FRAG_SZ0 - ETH_HLEN - (2 * VLAN_HLEN)
which is 1514 in x86.  AFAICT, we can at least raise the MTU
limit up to PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) which this
patch is doing.  It will be useful in the next patch which
allows XDP program to extend the packet by adding new header(s).

Note: In the earlier XDP patches, there is already existing guard
to ensure the page/pkt scheme only applies when XDP is active
in mlx4.

Signed-off-by: Martin KaFai Lau 
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 28 +++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 46 ++
 2 files changed, 44 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 6261157f444e..5482591688f8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -51,6 +51,8 @@
 #include "mlx4_en.h"
 #include "en_port.h"
 
+#define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN)))
+
 int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 {
struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -2249,6 +2251,19 @@ void mlx4_en_destroy_netdev(struct net_device *dev)
free_netdev(dev);
 }
 
+static bool mlx4_en_check_xdp_mtu(struct net_device *dev, int mtu)
+{
+   struct mlx4_en_priv *priv = netdev_priv(dev);
+
+   if (mtu > MLX4_EN_MAX_XDP_MTU) {
+   en_err(priv, "mtu:%d > max:%d when XDP prog is attached\n",
+  mtu, MLX4_EN_MAX_XDP_MTU);
+   return false;
+   }
+
+   return true;
+}
+
 static int mlx4_en_change_mtu(struct net_device *dev, int new_mtu)
 {
struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -2258,11 +2273,10 @@ static int mlx4_en_change_mtu(struct net_device *dev, 
int new_mtu)
en_dbg(DRV, priv, "Change MTU called - current:%d new:%d\n",
 dev->mtu, new_mtu);
 
-   if (priv->tx_ring_num[TX_XDP] && MLX4_EN_EFF_MTU(new_mtu) > FRAG_SZ0) {
-   en_err(priv, "MTU size:%d requires frags but XDP running\n",
-  new_mtu);
-   return -EOPNOTSUPP;
-   }
+   if (priv->tx_ring_num[TX_XDP] &&
+   !mlx4_en_check_xdp_mtu(dev, new_mtu))
+   return -ENOTSUPP;
+
dev->mtu = new_mtu;
 
if (netif_running(dev)) {
@@ -2710,10 +2724,8 @@ static int mlx4_xdp_set(struct net_device *dev, struct 
bpf_prog *prog)
return 0;
}
 
-   if (priv->num_frags > 1) {
-   en_err(priv, "Cannot set XDP if MTU requires multiple frags\n");
+   if (!mlx4_en_check_xdp_mtu(dev, dev->mtu))
return -EOPNOTSUPP;
-   }
 
tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
if (!tmp)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 6562f78b07f4..23e9d04d1ef4 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1164,37 +1164,39 @@ static const int frag_sizes[] = {
 
 void mlx4_en_calc_rx_buf(struct net_device *dev)
 {
-   enum dma_data_direction dma_dir = PCI_DMA_FROMDEVICE;
struct mlx4_en_priv *priv = netdev_priv(dev);
int eff_mtu = MLX4_EN_EFF_MTU(dev->mtu);
-   int order = MLX4_EN_ALLOC_PREFER_ORDER;
-   u32 align = SMP_CACHE_BYTES;
-   int buf_size = 0;
int i = 0;
 
/* bpf requires buffers to be set up as 1 packet per page.
 * This only works when num_frags == 1.
 */
if (priv->tx_ring_num[TX_XDP]) {
-   dma_dir = PCI_DMA_BIDIRECTIONAL;
-   /* This will gain efficient xdp frame recycling at the expense
-* of more costly truesize accounting
+   priv->frag_info[0].order = 0;
+   priv->frag_info[0].frag_size = eff_mtu;
+   priv->frag_info[0].frag_prefix_size = 0;
+   /* This will gain efficient xdp frame recycling at the
+* expense of more costly truesize accounting
 */
-   align = PAGE_SIZE;
-   order = 0;
-   }
-
-   while (buf_size < eff_mtu) {
-   priv->frag_info[i].order = order;
-   priv->frag_info[i].frag_size =
-   (eff_mtu > buf_size + frag_sizes[i]) ?
-   frag_sizes[i] : eff_mtu - buf_size;
-   priv->frag_info[i].frag_prefix_size = buf_size;
-   priv->frag_info[i].frag_stride =
-   ALIGN(priv->frag_info[i].frag_size, align);
-   priv->frag_info[i].dma_dir = dma_dir;
-   buf_size += priv->frag_info[i].frag_size;
-   i++;
+   priv->frag_info[0].frag_stride = PAGE_SIZE;
+

Re: [PATCH 10/10] virtio: enable endian checks for sparse builds

2016-12-06 Thread Jason Wang



On 2016年12月06日 23:41, Michael S. Tsirkin wrote:

__CHECK_ENDIAN__ isn't on by default presumably because
it triggers too many sparse warnings for correct code.
But virtio is now clean of these warnings, and
we want to keep it this way - enable this for
sparse builds.

Signed-off-by: Michael S. Tsirkin 
---

It seems that there should be a better way to do it,
but this works too.


Reviewed-by: Jason Wang 



  drivers/block/Makefile  | 1 +
  drivers/char/Makefile   | 1 +
  drivers/char/hw_random/Makefile | 2 ++
  drivers/gpu/drm/virtio/Makefile | 1 +
  drivers/net/Makefile| 3 +++
  drivers/net/caif/Makefile   | 1 +
  drivers/rpmsg/Makefile  | 1 +
  drivers/s390/virtio/Makefile| 2 ++
  drivers/scsi/Makefile   | 1 +
  drivers/vhost/Makefile  | 1 +
  drivers/virtio/Makefile | 3 +++
  net/9p/Makefile | 1 +
  net/packet/Makefile | 1 +
  net/vmw_vsock/Makefile  | 2 ++
  14 files changed, 21 insertions(+)

diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 1e9661e..597481c 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_BLK_DEV_OSD) += osdblk.o
  obj-$(CONFIG_BLK_DEV_UMEM)+= umem.o
  obj-$(CONFIG_BLK_DEV_NBD) += nbd.o
  obj-$(CONFIG_BLK_DEV_CRYPTOLOOP) += cryptoloop.o
+CFLAGS_virtio_blk.o += -D__CHECK_ENDIAN__
  obj-$(CONFIG_VIRTIO_BLK)  += virtio_blk.o
  
  obj-$(CONFIG_BLK_DEV_SX8)	+= sx8.o

diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 6e6c244..a99467d 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -6,6 +6,7 @@ obj-y   += mem.o random.o
  obj-$(CONFIG_TTY_PRINTK)  += ttyprintk.o
  obj-y += misc.o
  obj-$(CONFIG_ATARI_DSP56K)+= dsp56k.o
+CFLAGS_virtio_console.o += -D__CHECK_ENDIAN__
  obj-$(CONFIG_VIRTIO_CONSOLE)  += virtio_console.o
  obj-$(CONFIG_RAW_DRIVER)  += raw.o
  obj-$(CONFIG_SGI_SNSC)+= snsc.o snsc_event.o
diff --git a/drivers/char/hw_random/Makefile b/drivers/char/hw_random/Makefile
index 5f52b1e..a2b0931 100644
--- a/drivers/char/hw_random/Makefile
+++ b/drivers/char/hw_random/Makefile
@@ -17,6 +17,8 @@ obj-$(CONFIG_HW_RANDOM_IXP4XX) += ixp4xx-rng.o
  obj-$(CONFIG_HW_RANDOM_OMAP) += omap-rng.o
  obj-$(CONFIG_HW_RANDOM_OMAP3_ROM) += omap3-rom-rng.o
  obj-$(CONFIG_HW_RANDOM_PASEMI) += pasemi-rng.o
+CFLAGS_virtio_transport.o += -D__CHECK_ENDIAN__
+CFLAGS_virtio-rng.o += -D__CHECK_ENDIAN__
  obj-$(CONFIG_HW_RANDOM_VIRTIO) += virtio-rng.o
  obj-$(CONFIG_HW_RANDOM_TX4939) += tx4939-rng.o
  obj-$(CONFIG_HW_RANDOM_MXC_RNGA) += mxc-rnga.o
diff --git a/drivers/gpu/drm/virtio/Makefile b/drivers/gpu/drm/virtio/Makefile
index 3fb8eac..1162366 100644
--- a/drivers/gpu/drm/virtio/Makefile
+++ b/drivers/gpu/drm/virtio/Makefile
@@ -3,6 +3,7 @@
  # Direct Rendering Infrastructure (DRI) in XFree86 4.1.0 and higher.
  
  ccflags-y := -Iinclude/drm

+ccflags-y += -D__CHECK_ENDIAN__
  
  virtio-gpu-y := virtgpu_drv.o virtgpu_kms.o virtgpu_drm_bus.o virtgpu_gem.o \

virtgpu_fb.o virtgpu_display.o virtgpu_vq.o virtgpu_ttm.o \
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 7336cbd..3f587de 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -12,6 +12,7 @@ obj-$(CONFIG_EQUALIZER) += eql.o
  obj-$(CONFIG_IFB) += ifb.o
  obj-$(CONFIG_MACSEC) += macsec.o
  obj-$(CONFIG_MACVLAN) += macvlan.o
+CFLAGS_macvtap.o += -D__CHECK_ENDIAN__
  obj-$(CONFIG_MACVTAP) += macvtap.o
  obj-$(CONFIG_MII) += mii.o
  obj-$(CONFIG_MDIO) += mdio.o
@@ -20,8 +21,10 @@ obj-$(CONFIG_NETCONSOLE) += netconsole.o
  obj-$(CONFIG_PHYLIB) += phy/
  obj-$(CONFIG_RIONET) += rionet.o
  obj-$(CONFIG_NET_TEAM) += team/
+CFLAGS_tun.o += -D__CHECK_ENDIAN__
  obj-$(CONFIG_TUN) += tun.o
  obj-$(CONFIG_VETH) += veth.o
+CFLAGS_virtio_net.o += -D__CHECK_ENDIAN__
  obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
  obj-$(CONFIG_VXLAN) += vxlan.o
  obj-$(CONFIG_GENEVE) += geneve.o
diff --git a/drivers/net/caif/Makefile b/drivers/net/caif/Makefile
index 9bbd453..d1a922c 100644
--- a/drivers/net/caif/Makefile
+++ b/drivers/net/caif/Makefile
@@ -12,3 +12,4 @@ obj-$(CONFIG_CAIF_HSI) += caif_hsi.o
  
  # Virtio interface

  obj-$(CONFIG_CAIF_VIRTIO) += caif_virtio.o
+CFLAGS_caif_virtio.o += -D__CHECK_ENDIAN__
diff --git a/drivers/rpmsg/Makefile b/drivers/rpmsg/Makefile
index ae9c913..23c8b66 100644
--- a/drivers/rpmsg/Makefile
+++ b/drivers/rpmsg/Makefile
@@ -1,3 +1,4 @@
  obj-$(CONFIG_RPMSG)   += rpmsg_core.o
  obj-$(CONFIG_RPMSG_QCOM_SMD)  += qcom_smd.o
  obj-$(CONFIG_RPMSG_VIRTIO)+= virtio_rpmsg_bus.o
+CFLAGS_virtio_rpmsg_bus.o  += -D__CHECK_ENDIAN__
diff --git a/drivers/s390/virtio/Makefile b/drivers/s390/virtio/Makefile
index df40692..270ada5 100644
--- a/drivers/s390/virtio/Makefile
+++ b/drivers/s390/virtio/Makefile
@@ -6,6 +6,8 @@
  # it under the terms of the GNU General Public License 

[PATCH net-next v2 4/7] bnxt_en: Improve completion ring allocation for VFs.

2016-12-06 Thread Michael Chan
All available remaining completion rings not used by the PF should be
made available for the VFs so that there are enough rings in the VF to
support RDMA.  The earlier workaround code of capping the rings by the
statistics context is removed.

When SRIOV is disabled, call a new function bnxt_restore_pf_fw_resources()
to restore FW resources.  Later on we need to add some logic to account
for RDMA resources.

Signed-off-by: Somnath Kotur 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c   |  8 +++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h   |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 14 --
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 30b482b..52b8ad4 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4152,7 +4152,7 @@ static int bnxt_hwrm_func_qcfg(struct bnxt *bp)
return rc;
 }
 
-int bnxt_hwrm_func_qcaps(struct bnxt *bp)
+static int bnxt_hwrm_func_qcaps(struct bnxt *bp)
 {
int rc = 0;
struct hwrm_func_qcaps_input req = {0};
@@ -6856,6 +6856,12 @@ static int bnxt_set_dflt_rings(struct bnxt *bp)
return rc;
 }
 
+void bnxt_restore_pf_fw_resources(struct bnxt *bp)
+{
+   ASSERT_RTNL();
+   bnxt_hwrm_func_qcaps(bp);
+}
+
 static void bnxt_parse_log_pcie_link(struct bnxt *bp)
 {
enum pcie_link_width width = PCIE_LNK_WIDTH_UNKNOWN;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 0ee2cc4..43a4b17 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1234,7 +1234,6 @@ static inline void bnxt_disable_poll(struct bnxt_napi 
*bnapi)
 int hwrm_send_message(struct bnxt *, void *, u32, int);
 int hwrm_send_message_silent(struct bnxt *, void *, u32, int);
 int bnxt_hwrm_set_coal(struct bnxt *);
-int bnxt_hwrm_func_qcaps(struct bnxt *);
 void bnxt_set_max_func_irqs(struct bnxt *bp, unsigned int max);
 void bnxt_tx_disable(struct bnxt *bp);
 void bnxt_tx_enable(struct bnxt *bp);
@@ -1245,4 +1244,5 @@ static inline void bnxt_disable_poll(struct bnxt_napi 
*bnapi)
 int bnxt_close_nic(struct bnxt *, bool, bool);
 int bnxt_setup_mq_tc(struct net_device *dev, u8 tc);
 int bnxt_get_max_rings(struct bnxt *, int *, int *, bool);
+void bnxt_restore_pf_fw_resources(struct bnxt *bp);
 #endif
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
index bff626a..c696025 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
@@ -420,15 +420,7 @@ static int bnxt_hwrm_func_cfg(struct bnxt *bp, int num_vfs)
bnxt_hwrm_cmd_hdr_init(bp, , HWRM_FUNC_CFG, -1, -1);
 
/* Remaining rings are distributed equally amongs VF's for now */
-   /* TODO: the following workaroud is needed to restrict total number
-* of vf_cp_rings not exceed number of HW ring groups. This WA should
-* be removed once new HWRM provides HW ring groups capability in
-* hwrm_func_qcap.
-*/
-   vf_cp_rings = min_t(u16, pf->max_cp_rings, pf->max_stat_ctxs);
-   vf_cp_rings = (vf_cp_rings - bp->cp_nr_rings) / num_vfs;
-   /* TODO: restore this logic below once the WA above is removed */
-   /* vf_cp_rings = (pf->max_cp_rings - bp->cp_nr_rings) / num_vfs; */
+   vf_cp_rings = (pf->max_cp_rings - bp->cp_nr_rings) / num_vfs;
vf_stat_ctx = (pf->max_stat_ctxs - bp->num_stat_ctxs) / num_vfs;
if (bp->flags & BNXT_FLAG_AGG_RINGS)
vf_rx_rings = (pf->max_rx_rings - bp->rx_nr_rings * 2) /
@@ -590,7 +582,9 @@ void bnxt_sriov_disable(struct bnxt *bp)
 
bp->pf.active_vfs = 0;
/* Reclaim all resources for the PF. */
-   bnxt_hwrm_func_qcaps(bp);
+   rtnl_lock();
+   bnxt_restore_pf_fw_resources(bp);
+   rtnl_unlock();
 }
 
 int bnxt_sriov_configure(struct pci_dev *pdev, int num_vfs)
-- 
1.8.3.1



[PATCH net-next v2 6/7] bnxt_en: Refactor the driver registration function with firmware.

2016-12-06 Thread Michael Chan
The driver register function with firmware consists of passing version
information and registering for async events.  To support the RDMA driver,
the async events that we need to register may change.  Separate the
driver register function into 2 parts so that we can just update the
async events for the RDMA driver.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 34 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  2 ++
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 57285bd..c782942 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -3117,27 +3117,46 @@ int hwrm_send_message_silent(struct bnxt *bp, void 
*msg, u32 msg_len,
return rc;
 }
 
-static int bnxt_hwrm_func_drv_rgtr(struct bnxt *bp)
+int bnxt_hwrm_func_rgtr_async_events(struct bnxt *bp, unsigned long *bmap,
+int bmap_size)
 {
struct hwrm_func_drv_rgtr_input req = {0};
-   int i;
DECLARE_BITMAP(async_events_bmap, 256);
u32 *events = (u32 *)async_events_bmap;
+   int i;
 
bnxt_hwrm_cmd_hdr_init(bp, , HWRM_FUNC_DRV_RGTR, -1, -1);
 
req.enables =
-   cpu_to_le32(FUNC_DRV_RGTR_REQ_ENABLES_OS_TYPE |
-   FUNC_DRV_RGTR_REQ_ENABLES_VER |
-   FUNC_DRV_RGTR_REQ_ENABLES_ASYNC_EVENT_FWD);
+   cpu_to_le32(FUNC_DRV_RGTR_REQ_ENABLES_ASYNC_EVENT_FWD);
 
memset(async_events_bmap, 0, sizeof(async_events_bmap));
for (i = 0; i < ARRAY_SIZE(bnxt_async_events_arr); i++)
__set_bit(bnxt_async_events_arr[i], async_events_bmap);
 
+   if (bmap && bmap_size) {
+   for (i = 0; i < bmap_size; i++) {
+   if (test_bit(i, bmap))
+   __set_bit(i, async_events_bmap);
+   }
+   }
+
for (i = 0; i < 8; i++)
req.async_event_fwd[i] |= cpu_to_le32(events[i]);
 
+   return hwrm_send_message(bp, , sizeof(req), HWRM_CMD_TIMEOUT);
+}
+
+static int bnxt_hwrm_func_drv_rgtr(struct bnxt *bp)
+{
+   struct hwrm_func_drv_rgtr_input req = {0};
+
+   bnxt_hwrm_cmd_hdr_init(bp, , HWRM_FUNC_DRV_RGTR, -1, -1);
+
+   req.enables =
+   cpu_to_le32(FUNC_DRV_RGTR_REQ_ENABLES_OS_TYPE |
+   FUNC_DRV_RGTR_REQ_ENABLES_VER);
+
req.os_type = cpu_to_le16(FUNC_DRV_RGTR_REQ_OS_TYPE_LINUX);
req.ver_maj = DRV_VER_MAJ;
req.ver_min = DRV_VER_MIN;
@@ -3146,6 +3165,7 @@ static int bnxt_hwrm_func_drv_rgtr(struct bnxt *bp)
if (BNXT_PF(bp)) {
DECLARE_BITMAP(vf_req_snif_bmap, 256);
u32 *data = (u32 *)vf_req_snif_bmap;
+   int i;
 
memset(vf_req_snif_bmap, 0, sizeof(vf_req_snif_bmap));
for (i = 0; i < ARRAY_SIZE(bnxt_vf_req_snif); i++)
@@ -7023,6 +7043,10 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
if (rc)
goto init_err;
 
+   rc = bnxt_hwrm_func_rgtr_async_events(bp, NULL, 0);
+   if (rc)
+   goto init_err;
+
/* Get the MAX capabilities for this function */
rc = bnxt_hwrm_func_qcaps(bp);
if (rc) {
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index d796836..eec2415 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1240,6 +1240,8 @@ static inline void bnxt_disable_poll(struct bnxt_napi 
*bnapi)
 int _hwrm_send_message(struct bnxt *, void *, u32, int);
 int hwrm_send_message(struct bnxt *, void *, u32, int);
 int hwrm_send_message_silent(struct bnxt *, void *, u32, int);
+int bnxt_hwrm_func_rgtr_async_events(struct bnxt *bp, unsigned long *bmap,
+int bmap_size);
 int bnxt_hwrm_set_coal(struct bnxt *);
 unsigned int bnxt_get_max_func_stat_ctxs(struct bnxt *bp);
 unsigned int bnxt_get_max_func_cp_rings(struct bnxt *bp);
-- 
1.8.3.1



[PATCH net-next v2 5/7] bnxt_en: Reserve RDMA resources by default.

2016-12-06 Thread Michael Chan
If the device supports RDMA, we'll setup network default rings so that
there are enough minimum resources for RDMA, if possible.  However, the
user can still increase network rings to the max if he wants.  The actual
RDMA resources won't be reserved until the RDMA driver registers.

v2: Fix compile warning when BNXT_CONFIG_SRIOV is not set.

Signed-off-by: Somnath Kotur 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 58 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  9 +
 2 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 52b8ad4..57285bd 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4166,6 +4166,11 @@ static int bnxt_hwrm_func_qcaps(struct bnxt *bp)
if (rc)
goto hwrm_func_qcaps_exit;
 
+   if (resp->flags & cpu_to_le32(FUNC_QCAPS_RESP_FLAGS_ROCE_V1_SUPPORTED))
+   bp->flags |= BNXT_FLAG_ROCEV1_CAP;
+   if (resp->flags & cpu_to_le32(FUNC_QCAPS_RESP_FLAGS_ROCE_V2_SUPPORTED))
+   bp->flags |= BNXT_FLAG_ROCEV2_CAP;
+
bp->tx_push_thresh = 0;
if (resp->flags &
cpu_to_le32(FUNC_QCAPS_RESP_FLAGS_PUSH_MODE_SUPPORTED))
@@ -4808,6 +4813,24 @@ static int bnxt_setup_int_mode(struct bnxt *bp)
return rc;
 }
 
+unsigned int bnxt_get_max_func_stat_ctxs(struct bnxt *bp)
+{
+#if defined(CONFIG_BNXT_SRIOV)
+   if (BNXT_VF(bp))
+   return bp->vf.max_stat_ctxs;
+#endif
+   return bp->pf.max_stat_ctxs;
+}
+
+unsigned int bnxt_get_max_func_cp_rings(struct bnxt *bp)
+{
+#if defined(CONFIG_BNXT_SRIOV)
+   if (BNXT_VF(bp))
+   return bp->vf.max_cp_rings;
+#endif
+   return bp->pf.max_cp_rings;
+}
+
 static unsigned int bnxt_get_max_func_irqs(struct bnxt *bp)
 {
 #if defined(CONFIG_BNXT_SRIOV)
@@ -6832,6 +6855,39 @@ int bnxt_get_max_rings(struct bnxt *bp, int *max_rx, int 
*max_tx, bool shared)
return bnxt_trim_rings(bp, max_rx, max_tx, cp, shared);
 }
 
+static int bnxt_get_dflt_rings(struct bnxt *bp, int *max_rx, int *max_tx,
+  bool shared)
+{
+   int rc;
+
+   rc = bnxt_get_max_rings(bp, max_rx, max_tx, shared);
+   if (rc)
+   return rc;
+
+   if (bp->flags & BNXT_FLAG_ROCE_CAP) {
+   int max_cp, max_stat, max_irq;
+
+   /* Reserve minimum resources for RoCE */
+   max_cp = bnxt_get_max_func_cp_rings(bp);
+   max_stat = bnxt_get_max_func_stat_ctxs(bp);
+   max_irq = bnxt_get_max_func_irqs(bp);
+   if (max_cp <= BNXT_MIN_ROCE_CP_RINGS ||
+   max_irq <= BNXT_MIN_ROCE_CP_RINGS ||
+   max_stat <= BNXT_MIN_ROCE_STAT_CTXS)
+   return 0;
+
+   max_cp -= BNXT_MIN_ROCE_CP_RINGS;
+   max_irq -= BNXT_MIN_ROCE_CP_RINGS;
+   max_stat -= BNXT_MIN_ROCE_STAT_CTXS;
+   max_cp = min_t(int, max_cp, max_irq);
+   max_cp = min_t(int, max_cp, max_stat);
+   rc = bnxt_trim_rings(bp, max_rx, max_tx, max_cp, shared);
+   if (rc)
+   rc = 0;
+   }
+   return rc;
+}
+
 static int bnxt_set_dflt_rings(struct bnxt *bp)
 {
int dflt_rings, max_rx_rings, max_tx_rings, rc;
@@ -6840,7 +6896,7 @@ static int bnxt_set_dflt_rings(struct bnxt *bp)
if (sh)
bp->flags |= BNXT_FLAG_SHARED_RINGS;
dflt_rings = netif_get_num_default_rss_queues();
-   rc = bnxt_get_max_rings(bp, _rx_rings, _tx_rings, sh);
+   rc = bnxt_get_dflt_rings(bp, _rx_rings, _tx_rings, sh);
if (rc)
return rc;
bp->rx_nr_rings = min_t(int, dflt_rings, max_rx_rings);
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 43a4b17..d796836 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -387,6 +387,9 @@ struct rx_tpa_end_cmp_ext {
 #define DB_KEY_TX_PUSH (0x4 << 28)
 #define DB_LONG_TX_PUSH(0x2 << 
24)
 
+#define BNXT_MIN_ROCE_CP_RINGS 2
+#define BNXT_MIN_ROCE_STAT_CTXS1
+
 #define INVALID_HW_RING_ID ((u16)-1)
 
 /* The hardware supports certain page sizes.  Use the supported page sizes
@@ -953,6 +956,10 @@ struct bnxt {
#define BNXT_FLAG_PORT_STATS0x400
#define BNXT_FLAG_UDP_RSS_CAP   0x800
#define BNXT_FLAG_EEE_CAP   0x1000
+   #define BNXT_FLAG_ROCEV1_CAP0x8000
+   #define BNXT_FLAG_ROCEV2_CAP0x1
+   #define BNXT_FLAG_ROCE_CAP  (BNXT_FLAG_ROCEV1_CAP | \
+BNXT_FLAG_ROCEV2_CAP)
#define BNXT_FLAG_CHIP_NITRO_A0 

[PATCH net-next v2 7/7] bnxt_en: Add interface to support RDMA driver.

2016-12-06 Thread Michael Chan
Since the network driver and RDMA driver operate on the same PCI function,
we need to create an interface to allow the RDMA driver to share resources
with the network driver.

1. Create a new bnxt_en_dev struct which will be returned by
bnxt_ulp_probe() upon success.  After that, all calls from the RDMA driver
to bnxt_en will pass a pointer to this struct.

2. This struct contains additional function pointers to register, request
msix, send fw messages, register for async events.

3. If the RDMA driver wants to enable RDMA on the function, it needs to
call the function pointer bnxt_register_device().  A ulp_ops structure
is passed for RCU protected upcalls from bnxt_en to the RDMA driver.

4. The RDMA driver can call firmware APIs using the bnxt_send_fw_msg()
function pointer.

5. 1 stats context is reserved when the RDMA driver registers.  MSIX
and completion rings are reserved when the RDMA driver calls
bnxt_request_msix() function pointer.

6. When the RDMA driver calls bnxt_unregister_device(), all RDMA resources
will be cleaned up.

v2: Fixed 2 uninitialized variable warnings.

Signed-off-by: Somnath Kotur 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/Makefile   |   2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  41 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |   6 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c | 346 ++
 drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.h |  93 +++
 5 files changed, 483 insertions(+), 5 deletions(-)
 create mode 100644 drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
 create mode 100644 drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.h

diff --git a/drivers/net/ethernet/broadcom/bnxt/Makefile 
b/drivers/net/ethernet/broadcom/bnxt/Makefile
index b233a86..6082ed1 100644
--- a/drivers/net/ethernet/broadcom/bnxt/Makefile
+++ b/drivers/net/ethernet/broadcom/bnxt/Makefile
@@ -1,3 +1,3 @@
 obj-$(CONFIG_BNXT) += bnxt_en.o
 
-bnxt_en-y := bnxt.o bnxt_sriov.o bnxt_ethtool.o bnxt_dcb.o
+bnxt_en-y := bnxt.o bnxt_sriov.o bnxt_ethtool.o bnxt_dcb.o bnxt_ulp.o
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index c782942..9608cb4 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -52,6 +52,7 @@
 
 #include "bnxt_hsi.h"
 #include "bnxt.h"
+#include "bnxt_ulp.h"
 #include "bnxt_sriov.h"
 #include "bnxt_ethtool.h"
 #include "bnxt_dcb.h"
@@ -1528,12 +1529,11 @@ static int bnxt_async_event_process(struct bnxt *bp,
set_bit(BNXT_RESET_TASK_SILENT_SP_EVENT, >sp_event);
break;
default:
-   netdev_err(bp->dev, "unhandled ASYNC event (id 0x%x)\n",
-  event_id);
goto async_event_process_exit;
}
schedule_work(>sp_task);
 async_event_process_exit:
+   bnxt_ulp_async_events(bp, cmpl);
return 0;
 }
 
@@ -3547,7 +3547,7 @@ static int bnxt_hwrm_vnic_ctx_alloc(struct bnxt *bp, u16 
vnic_id, u16 ctx_idx)
return rc;
 }
 
-static int bnxt_hwrm_vnic_cfg(struct bnxt *bp, u16 vnic_id)
+int bnxt_hwrm_vnic_cfg(struct bnxt *bp, u16 vnic_id)
 {
unsigned int ring = 0, grp_idx;
struct bnxt_vnic_info *vnic = >vnic_info[vnic_id];
@@ -3595,6 +3595,9 @@ static int bnxt_hwrm_vnic_cfg(struct bnxt *bp, u16 
vnic_id)
 #endif
if ((bp->flags & BNXT_FLAG_STRIP_VLAN) || def_vlan)
req.flags |= cpu_to_le32(VNIC_CFG_REQ_FLAGS_VLAN_STRIP_MODE);
+   if (!vnic_id && bnxt_ulp_registered(bp->edev, BNXT_ROCE_ULP))
+   req.flags |=
+   cpu_to_le32(VNIC_CFG_REQ_FLAGS_ROCE_DUAL_VNIC_MODE);
 
return hwrm_send_message(bp, , sizeof(req), HWRM_CMD_TIMEOUT);
 }
@@ -4842,6 +4845,16 @@ unsigned int bnxt_get_max_func_stat_ctxs(struct bnxt *bp)
return bp->pf.max_stat_ctxs;
 }
 
+void bnxt_set_max_func_stat_ctxs(struct bnxt *bp, unsigned int max)
+{
+#if defined(CONFIG_BNXT_SRIOV)
+   if (BNXT_VF(bp))
+   bp->vf.max_stat_ctxs = max;
+   else
+#endif
+   bp->pf.max_stat_ctxs = max;
+}
+
 unsigned int bnxt_get_max_func_cp_rings(struct bnxt *bp)
 {
 #if defined(CONFIG_BNXT_SRIOV)
@@ -4851,6 +4864,16 @@ unsigned int bnxt_get_max_func_cp_rings(struct bnxt *bp)
return bp->pf.max_cp_rings;
 }
 
+void bnxt_set_max_func_cp_rings(struct bnxt *bp, unsigned int max)
+{
+#if defined(CONFIG_BNXT_SRIOV)
+   if (BNXT_VF(bp))
+   bp->vf.max_cp_rings = max;
+   else
+#endif
+   bp->pf.max_cp_rings = max;
+}
+
 static unsigned int bnxt_get_max_func_irqs(struct bnxt *bp)
 {
 #if defined(CONFIG_BNXT_SRIOV)
@@ -6767,6 +6790,8 @@ static void bnxt_remove_one(struct pci_dev *pdev)
pci_iounmap(pdev, bp->bar2);
pci_iounmap(pdev, bp->bar1);
pci_iounmap(pdev, bp->bar0);
+   kfree(bp->edev);
+   bp->edev = NULL;
free_netdev(dev);
 
  

[PATCH net-next v2 2/7] bnxt_en: Enable MSIX early in bnxt_init_one().

2016-12-06 Thread Michael Chan
To better support the new RDMA driver, we need to move pci_enable_msix()
from bnxt_open() to bnxt_init_one().  This way, MSIX vectors are available
to the RDMA driver whether the network device is up or down.

Part of the existing bnxt_setup_int_mode() function is now refactored into
a new bnxt_init_int_mode().  bnxt_init_int_mode() is called during
bnxt_init_one() to enable MSIX.  The remaining logic in
bnxt_setup_int_mode() to map the IRQs to the completion rings is called
during bnxt_open().

v2: Fixed compile warning when CONFIG_BNXT_SRIOV is not set.

Signed-off-by: Somnath Kotur 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 183 +++---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |   1 +
 2 files changed, 115 insertions(+), 69 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 269e757..da302eb 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4743,6 +4743,80 @@ static int bnxt_trim_rings(struct bnxt *bp, int *rx, int 
*tx, int max,
return 0;
 }
 
+static void bnxt_setup_msix(struct bnxt *bp)
+{
+   const int len = sizeof(bp->irq_tbl[0].name);
+   struct net_device *dev = bp->dev;
+   int tcs, i;
+
+   tcs = netdev_get_num_tc(dev);
+   if (tcs > 1) {
+   bp->tx_nr_rings_per_tc = bp->tx_nr_rings / tcs;
+   if (bp->tx_nr_rings_per_tc == 0) {
+   netdev_reset_tc(dev);
+   bp->tx_nr_rings_per_tc = bp->tx_nr_rings;
+   } else {
+   int i, off, count;
+
+   bp->tx_nr_rings = bp->tx_nr_rings_per_tc * tcs;
+   for (i = 0; i < tcs; i++) {
+   count = bp->tx_nr_rings_per_tc;
+   off = i * count;
+   netdev_set_tc_queue(dev, i, count, off);
+   }
+   }
+   }
+
+   for (i = 0; i < bp->cp_nr_rings; i++) {
+   char *attr;
+
+   if (bp->flags & BNXT_FLAG_SHARED_RINGS)
+   attr = "TxRx";
+   else if (i < bp->rx_nr_rings)
+   attr = "rx";
+   else
+   attr = "tx";
+
+   snprintf(bp->irq_tbl[i].name, len, "%s-%s-%d", dev->name, attr,
+i);
+   bp->irq_tbl[i].handler = bnxt_msix;
+   }
+}
+
+static void bnxt_setup_inta(struct bnxt *bp)
+{
+   const int len = sizeof(bp->irq_tbl[0].name);
+
+   if (netdev_get_num_tc(bp->dev))
+   netdev_reset_tc(bp->dev);
+
+   snprintf(bp->irq_tbl[0].name, len, "%s-%s-%d", bp->dev->name, "TxRx",
+0);
+   bp->irq_tbl[0].handler = bnxt_inta;
+}
+
+static int bnxt_setup_int_mode(struct bnxt *bp)
+{
+   int rc;
+
+   if (bp->flags & BNXT_FLAG_USING_MSIX)
+   bnxt_setup_msix(bp);
+   else
+   bnxt_setup_inta(bp);
+
+   rc = bnxt_set_real_num_queues(bp);
+   return rc;
+}
+
+static unsigned int bnxt_get_max_func_irqs(struct bnxt *bp)
+{
+#if defined(CONFIG_BNXT_SRIOV)
+   if (BNXT_VF(bp))
+   return bp->vf.max_irqs;
+#endif
+   return bp->pf.max_irqs;
+}
+
 void bnxt_set_max_func_irqs(struct bnxt *bp, unsigned int max_irqs)
 {
 #if defined(CONFIG_BNXT_SRIOV)
@@ -4753,16 +4827,12 @@ void bnxt_set_max_func_irqs(struct bnxt *bp, unsigned 
int max_irqs)
bp->pf.max_irqs = max_irqs;
 }
 
-static int bnxt_setup_msix(struct bnxt *bp)
+static int bnxt_init_msix(struct bnxt *bp)
 {
-   struct msix_entry *msix_ent;
-   struct net_device *dev = bp->dev;
int i, total_vecs, rc = 0, min = 1;
-   const int len = sizeof(bp->irq_tbl[0].name);
-
-   bp->flags &= ~BNXT_FLAG_USING_MSIX;
-   total_vecs = bp->cp_nr_rings;
+   struct msix_entry *msix_ent;
 
+   total_vecs = bnxt_get_max_func_irqs(bp);
msix_ent = kcalloc(total_vecs, sizeof(struct msix_entry), GFP_KERNEL);
if (!msix_ent)
return -ENOMEM;
@@ -4783,8 +4853,10 @@ static int bnxt_setup_msix(struct bnxt *bp)
 
bp->irq_tbl = kcalloc(total_vecs, sizeof(struct bnxt_irq), GFP_KERNEL);
if (bp->irq_tbl) {
-   int tcs;
+   for (i = 0; i < total_vecs; i++)
+   bp->irq_tbl[i].vector = msix_ent[i].vector;
 
+   bp->total_irqs = total_vecs;
/* Trim rings based upon num of vectors allocated */
rc = bnxt_trim_rings(bp, >rx_nr_rings, >tx_nr_rings,
 total_vecs, min == 1);
@@ -4792,43 +4864,10 @@ static int bnxt_setup_msix(struct bnxt *bp)
goto msix_setup_exit;
 
bp->tx_nr_rings_per_tc = bp->tx_nr_rings;
-   tcs = 

[PATCH net-next v2 3/7] bnxt_en: Move function reset to bnxt_init_one().

2016-12-06 Thread Michael Chan
Now that MSIX is enabled in bnxt_init_one(), resources may be allocated by
the RDMA driver before the network device is opened.  So we cannot do
function reset in bnxt_open() which will clear all the resources.

The proper place to do function reset now is in bnxt_init_one().
If we get AER, we'll do function reset as well.

Signed-off-by: Somnath Kotur 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 25 ++---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  1 -
 2 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index da302eb..30b482b 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -5613,22 +5613,7 @@ int bnxt_open_nic(struct bnxt *bp, bool irq_re_init, 
bool link_re_init)
 static int bnxt_open(struct net_device *dev)
 {
struct bnxt *bp = netdev_priv(dev);
-   int rc = 0;
 
-   if (!test_bit(BNXT_STATE_FN_RST_DONE, >state)) {
-   rc = bnxt_hwrm_func_reset(bp);
-   if (rc) {
-   netdev_err(bp->dev, "hwrm chip reset failure rc: %x\n",
-  rc);
-   rc = -EBUSY;
-   return rc;
-   }
-   /* Do func_reset during the 1st PF open only to prevent killing
-* the VFs when the PF is brought down and up.
-*/
-   if (BNXT_PF(bp))
-   set_bit(BNXT_STATE_FN_RST_DONE, >state);
-   }
return __bnxt_open_nic(bp, true, true);
 }
 
@@ -7028,6 +7013,10 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
if (rc)
goto init_err;
 
+   rc = bnxt_hwrm_func_reset(bp);
+   if (rc)
+   goto init_err;
+
rc = bnxt_init_int_mode(bp);
if (rc)
goto init_err;
@@ -7069,7 +7058,6 @@ static pci_ers_result_t bnxt_io_error_detected(struct 
pci_dev *pdev,
   pci_channel_state_t state)
 {
struct net_device *netdev = pci_get_drvdata(pdev);
-   struct bnxt *bp = netdev_priv(netdev);
 
netdev_info(netdev, "PCI I/O error detected\n");
 
@@ -7084,8 +7072,6 @@ static pci_ers_result_t bnxt_io_error_detected(struct 
pci_dev *pdev,
if (netif_running(netdev))
bnxt_close(netdev);
 
-   /* So that func_reset will be done during slot_reset */
-   clear_bit(BNXT_STATE_FN_RST_DONE, >state);
pci_disable_device(pdev);
rtnl_unlock();
 
@@ -7119,7 +7105,8 @@ static pci_ers_result_t bnxt_io_slot_reset(struct pci_dev 
*pdev)
} else {
pci_set_master(pdev);
 
-   if (netif_running(netdev))
+   err = bnxt_hwrm_func_reset(bp);
+   if (!err && netif_running(netdev))
err = bnxt_open(netdev);
 
if (!err)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 1461355..0ee2cc4 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1021,7 +1021,6 @@ struct bnxt {
unsigned long   state;
 #define BNXT_STATE_OPEN0
 #define BNXT_STATE_IN_SP_TASK  1
-#define BNXT_STATE_FN_RST_DONE 2
 
struct bnxt_irq *irq_tbl;
int total_irqs;
-- 
1.8.3.1



[PATCH net-next v2 1/7] bnxt_en: Add bnxt_set_max_func_irqs().

2016-12-06 Thread Michael Chan
By refactoring existing code into this new function.  The new function
will be used in subsequent patches.

v2: Fixed compile warning when CONFIG_BNXT_SRIOV is not set.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 17 +++--
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  1 +
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index e84613a..269e757 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4743,6 +4743,16 @@ static int bnxt_trim_rings(struct bnxt *bp, int *rx, int 
*tx, int max,
return 0;
 }
 
+void bnxt_set_max_func_irqs(struct bnxt *bp, unsigned int max_irqs)
+{
+#if defined(CONFIG_BNXT_SRIOV)
+   if (BNXT_VF(bp))
+   bp->vf.max_irqs = max_irqs;
+   else
+#endif
+   bp->pf.max_irqs = max_irqs;
+}
+
 static int bnxt_setup_msix(struct bnxt *bp)
 {
struct msix_entry *msix_ent;
@@ -6949,12 +6959,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
 
bnxt_set_tpa_flags(bp);
bnxt_set_ring_params(bp);
-   if (BNXT_PF(bp))
-   bp->pf.max_irqs = max_irqs;
-#if defined(CONFIG_BNXT_SRIOV)
-   else
-   bp->vf.max_irqs = max_irqs;
-#endif
+   bnxt_set_max_func_irqs(bp, max_irqs);
bnxt_set_dflt_rings(bp);
 
/* Default RSS hash cfg. */
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index b4abc1b..8327d0d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1235,6 +1235,7 @@ static inline void bnxt_disable_poll(struct bnxt_napi 
*bnapi)
 int hwrm_send_message_silent(struct bnxt *, void *, u32, int);
 int bnxt_hwrm_set_coal(struct bnxt *);
 int bnxt_hwrm_func_qcaps(struct bnxt *);
+void bnxt_set_max_func_irqs(struct bnxt *bp, unsigned int max);
 void bnxt_tx_disable(struct bnxt *bp);
 void bnxt_tx_enable(struct bnxt *bp);
 int bnxt_hwrm_set_pause(struct bnxt *);
-- 
1.8.3.1



[PATCH net-next v2 0/7] bnxt_en: Add interface to support RDMA driver.

2016-12-06 Thread Michael Chan
This series adds an interface to support a brand new RDMA driver bnxt_re.
The first step is to re-arrange some code so that pci_enable_msix() can
be called during pci probe.  The purpose is to allow the RDMA driver to
initialize and stay initialized whether the netdev is up or down.

Then we make some changes to VF resource allocation so that there is
enough resources to support RDMA.

Finally the last patch adds a simple interface to allow the RDMA driver to
probe and register itself with any bnxt_en devices that support RDMA.
Once registered, the RDMA driver can request MSIX, send fw messages, and
receive some notifications.

v2: Fixed kbuild test robot warnings.

David, please consider this series for net-next.  Thanks.

Michael Chan (7):
  bnxt_en: Add bnxt_set_max_func_irqs().
  bnxt_en: Enable MSIX early in bnxt_init_one().
  bnxt_en: Move function reset to bnxt_init_one().
  bnxt_en: Improve completion ring allocation for VFs.
  bnxt_en: Reserve RDMA resources by default.
  bnxt_en: Refactor the driver registration function with firmware.
  bnxt_en: Add interface to support RDMA driver.

 drivers/net/ethernet/broadcom/bnxt/Makefile |   2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c   | 360 +---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h   |  22 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c |  14 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c   | 346 +++
 drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.h   |  93 ++
 6 files changed, 722 insertions(+), 115 deletions(-)
 create mode 100644 drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
 create mode 100644 drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.h

-- 
1.8.3.1



Re: [PATCH] net: wireless: realtek: constify rate_control_ops structures

2016-12-06 Thread Jes Sorensen
Larry Finger  writes:
> On 12/02/2016 03:50 AM, Bhumika Goyal wrote:
>> The structures rate_control_ops are only passed as an argument to the
>> functions ieee80211_rate_control_{register/unregister}. This argument is
>> of type const, so rate_control_ops having this property can also be
>> declared as const.
>> Done using Coccinelle:
>>
>> @r1 disable optional_qualifier @
>> identifier i;
>> position p;
>> @@
>> static struct rate_control_ops i@p = {...};
>>
>> @ok1@
>> identifier r1.i;
>> position p;
>> @@
>> ieee80211_rate_control_register(@p)
>>
>> @ok2@
>> identifier r1.i;
>> position p;
>> @@
>> ieee80211_rate_control_unregister(@p)
>>
>> @bad@
>> position p!={r1.p,ok1.p,ok2.p};
>> identifier r1.i;
>> @@
>> i@p
>>
>> @depends on !bad disable optional_qualifier@
>> identifier r1.i;
>> @@
>> static
>> +const
>> struct rate_control_ops i={...};
>>
>> @depends on !bad disable optional_qualifier@
>> identifier r1.i;
>> @@
>> +const
>> struct rate_control_ops i;
>>
>> File size before:
>>text data bss dec hex filename
>>1991  104   02095 82f wireless/realtek/rtlwifi/rc.o
>>
>> File size after:
>>text data bss dec hex filename
>>20950   02095 wireless/realtek/rtlwifi/rc.o
>>
[snip]
> The content of your patch is OK; however, your subject is not. By
> convention, "net: wireless: realtek:" is assumed. We do, however,
> include "rtlwifi:" to indicate which part of
> drivers/net/wireless/realtek/ is referenced.

In addition, the first part of the description is useful and the file
size information is reasonable too, but ~20 lines of coccinelle scripts
in the commit message is rather pointless.

Jes


Re: [PATCH] net: return value of skb_linearize should be handled in Linux kernel

2016-12-06 Thread Cong Wang
On Mon, Dec 5, 2016 at 11:10 PM, Zhouyi Zhou  wrote:
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c 
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
> index 2a653ec..ab787cb 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
> @@ -490,7 +490,11 @@ int ixgbe_fcoe_ddp(struct ixgbe_adapter *adapter,
>  */
> if ((fh->fh_r_ctl == FC_RCTL_DD_SOL_DATA) &&
> (fctl & FC_FC_END_SEQ)) {
> -   skb_linearize(skb);
> +   int err = 0;
> +
> +   err = skb_linearize(skb);
> +   if (err)
> +   return err;


You can reuse 'rc' instead of adding 'err'.



> crc = (struct fcoe_crc_eof *)skb_put(skb, sizeof(*crc));
> crc->fcoe_eof = FC_EOF_T;
> }
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index fee1f29..4926d48 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -2173,8 +2173,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
> *q_vector,
> total_rx_bytes += ddp_bytes;
> total_rx_packets += DIV_ROUND_UP(ddp_bytes,
>  mss);
> -   }
> -   if (!ddp_bytes) {
> +   } else {
> dev_kfree_skb_any(skb);
> continue;
> }


This piece doesn't seem to be related.


[PATCH net] phy: Don't increment MDIO bus refcount unless it's a different owner

2016-12-06 Thread Florian Fainelli
Commit 3e3aaf649416 ("phy: fix mdiobus module safety") fixed the way we
dealt with MDIO bus module reference count, but sort of introduced a
regression in that, if an Ethernet driver registers its own MDIO bus
driver, as is common, we will end up with the Ethernet driver's
module->refnct set to 1, thus preventing this driver from any removal.

Fix this by comparing the network device's device driver owner against
the MDIO bus driver owner, and only if they are different, increment the
MDIO bus module refcount.

Fixes: 3e3aaf649416 ("phy: fix mdiobus module safety")
Signed-off-by: Florian Fainelli 
---
Russell,

I verified this against the ethoc driver primarily (on a TS7300 board)
and bcmgenet.

Thanks!

 drivers/net/phy/phy_device.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 1a4bf8acad78..c4ceb082e970 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -857,11 +857,17 @@ EXPORT_SYMBOL(phy_attached_print);
 int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
  u32 flags, phy_interface_t interface)
 {
+   struct module *ndev_owner = dev->dev.parent->driver->owner;
struct mii_bus *bus = phydev->mdio.bus;
struct device *d = >mdio.dev;
int err;
 
-   if (!try_module_get(bus->owner)) {
+   /* For Ethernet device drivers that register their own MDIO bus, we
+* will have bus->owner match ndev_mod, so we do not want to increment
+* our own module->refcnt here, otherwise we would not be able to
+* unload later on.
+*/
+   if (ndev_owner != bus->owner && !try_module_get(bus->owner)) {
dev_err(>dev, "failed to get the bus module\n");
return -EIO;
}
@@ -921,7 +927,8 @@ int phy_attach_direct(struct net_device *dev, struct 
phy_device *phydev,
 
 error:
put_device(d);
-   module_put(bus->owner);
+   if (ndev_owner != bus->owner)
+   module_put(bus->owner);
return err;
 }
 EXPORT_SYMBOL(phy_attach_direct);
@@ -971,6 +978,8 @@ EXPORT_SYMBOL(phy_attach);
  */
 void phy_detach(struct phy_device *phydev)
 {
+   struct net_device *dev = phydev->attached_dev;
+   struct module *ndev_owner = dev->dev.parent->driver->owner;
struct mii_bus *bus;
int i;
 
@@ -998,7 +1007,8 @@ void phy_detach(struct phy_device *phydev)
bus = phydev->mdio.bus;
 
put_device(>mdio.dev);
-   module_put(bus->owner);
+   if (ndev_owner != bus->owner)
+   module_put(bus->owner);
 }
 EXPORT_SYMBOL(phy_detach);
 
-- 
2.9.3



Re: [PATCH 09/10] vsock/virtio: fix src/dst cid format

2016-12-06 Thread Jason Wang



On 2016年12月06日 23:41, Michael S. Tsirkin wrote:

These fields are 64 bit, using le32_to_cpu and friends
on these will not do the right thing.
Fix this up.

Cc: sta...@vger.kernel.org
Signed-off-by: Michael S. Tsirkin 
---
  net/vmw_vsock/virtio_transport_common.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 6120384..22e99c4 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -606,9 +606,9 @@ static int virtio_transport_reset_no_sock(struct 
virtio_vsock_pkt *pkt)
return 0;
  
  	pkt = virtio_transport_alloc_pkt(, 0,

-le32_to_cpu(pkt->hdr.dst_cid),
+le64_to_cpu(pkt->hdr.dst_cid),
 le32_to_cpu(pkt->hdr.dst_port),
-le32_to_cpu(pkt->hdr.src_cid),
+le64_to_cpu(pkt->hdr.src_cid),
 le32_to_cpu(pkt->hdr.src_port));


Looking at sockaddr_vm, svm_cid is "unsigned int", do we really want 64 
bit here?



if (!pkt)
return -ENOMEM;
@@ -823,7 +823,7 @@ virtio_transport_send_response(struct vsock_sock *vsk,
struct virtio_vsock_pkt_info info = {
.op = VIRTIO_VSOCK_OP_RESPONSE,
.type = VIRTIO_VSOCK_TYPE_STREAM,
-   .remote_cid = le32_to_cpu(pkt->hdr.src_cid),
+   .remote_cid = le64_to_cpu(pkt->hdr.src_cid),
.remote_port = le32_to_cpu(pkt->hdr.src_port),
.reply = true,
};
@@ -863,9 +863,9 @@ virtio_transport_recv_listen(struct sock *sk, struct 
virtio_vsock_pkt *pkt)
child->sk_state = SS_CONNECTED;
  
  	vchild = vsock_sk(child);

-   vsock_addr_init(>local_addr, le32_to_cpu(pkt->hdr.dst_cid),
+   vsock_addr_init(>local_addr, le64_to_cpu(pkt->hdr.dst_cid),
le32_to_cpu(pkt->hdr.dst_port));
-   vsock_addr_init(>remote_addr, le32_to_cpu(pkt->hdr.src_cid),
+   vsock_addr_init(>remote_addr, le64_to_cpu(pkt->hdr.src_cid),
le32_to_cpu(pkt->hdr.src_port));
  
  	vsock_insert_connected(vchild);

@@ -904,9 +904,9 @@ void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
struct sock *sk;
bool space_available;
  
-	vsock_addr_init(, le32_to_cpu(pkt->hdr.src_cid),

+   vsock_addr_init(, le64_to_cpu(pkt->hdr.src_cid),
le32_to_cpu(pkt->hdr.src_port));
-   vsock_addr_init(, le32_to_cpu(pkt->hdr.dst_cid),
+   vsock_addr_init(, le64_to_cpu(pkt->hdr.dst_cid),
le32_to_cpu(pkt->hdr.dst_port));
  
  	trace_virtio_transport_recv_pkt(src.svm_cid, src.svm_port,




[PATCH net-next 1/1] driver: macvlan: Remove the rcu member of macvlan_port

2016-12-06 Thread fgao
From: Gao Feng 

When free macvlan_port in macvlan_port_destroy, it is safe to free
directly because netdev_rx_handler_unregister could enforce one
grace period.
So it is unnecessary to use kfree_rcu for macvlan_port.

Signed-off-by: Gao Feng 
---
 drivers/net/macvlan.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 3c0a171..20b3fdf2 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -43,7 +43,6 @@ struct macvlan_port {
struct net_device   *dev;
struct hlist_head   vlan_hash[MACVLAN_HASH_SIZE];
struct list_headvlans;
-   struct rcu_head rcu;
struct sk_buff_head bc_queue;
struct work_struct  bc_work;
boolpassthru;
@@ -1151,7 +1150,7 @@ static void macvlan_port_destroy(struct net_device *dev)
cancel_work_sync(>bc_work);
__skb_queue_purge(>bc_queue);
 
-   kfree_rcu(port, rcu);
+   kfree(port);
 }
 
 static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
-- 
1.9.1




Re: [PATCH 08/10] vsock/virtio: mark an internal function static

2016-12-06 Thread Jason Wang



On 2016年12月06日 23:41, Michael S. Tsirkin wrote:

virtio_transport_alloc_pkt is only used locally, make it static.

Signed-off-by: Michael S. Tsirkin 
---
  net/vmw_vsock/virtio_transport_common.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index a53b3a1..6120384 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -32,7 +32,7 @@ static const struct virtio_transport 
*virtio_transport_get_ops(void)
return container_of(t, struct virtio_transport, transport);
  }
  
-struct virtio_vsock_pkt *

+static struct virtio_vsock_pkt *
  virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
   size_t len,
   u32 src_cid,


Git grep shows it was used by tracing.


Re: [PATCH 07/10] vsock/virtio: add a missing __le annotation

2016-12-06 Thread Jason Wang



On 2016年12月06日 23:40, Michael S. Tsirkin wrote:

guest cid is read from config space, therefore it's in little endian
format and is treated as such, annotate it accordingly.

Signed-off-by: Michael S. Tsirkin 
---
  net/vmw_vsock/virtio_transport.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 936d7ee..90096b9 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -336,7 +336,7 @@ static void virtio_vsock_reset_sock(struct sock *sk)
  static void virtio_vsock_update_guest_cid(struct virtio_vsock *vsock)
  {
struct virtio_device *vdev = vsock->vdev;
-   u64 guest_cid;
+   __le64 guest_cid;
  
  	vdev->config->get(vdev, offsetof(struct virtio_vsock_config, guest_cid),

  _cid, sizeof(guest_cid));


Reviewed-by: Jason Wang 


Re: [PATCH 06/10] vhost: add missing __user annotations

2016-12-06 Thread Jason Wang



On 2016年12月06日 23:40, Michael S. Tsirkin wrote:

Several vhost functions were missing __user annotations
on pointers, causing sparse warnings. Fix this up.

Signed-off-by: Michael S. Tsirkin 
---
  drivers/vhost/vhost.c | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 7331ef3..ba7db68 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -719,7 +719,7 @@ static int memory_access_ok(struct vhost_dev *d, struct 
vhost_umem *umem,
  static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
  struct iovec iov[], int iov_size, int access);
  
-static int vhost_copy_to_user(struct vhost_virtqueue *vq, void *to,

+static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
  const void *from, unsigned size)
  {
int ret;
@@ -749,7 +749,7 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, 
void *to,
  }
  
  static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,

-   void *from, unsigned size)
+   void __user *from, unsigned size)
  {
int ret;
  
@@ -783,7 +783,7 @@ static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,

  }
  
  static void __user *__vhost_get_user(struct vhost_virtqueue *vq,

-void *addr, unsigned size)
+void __user *addr, unsigned size)
  {
int ret;
  
@@ -934,8 +934,8 @@ static int umem_access_ok(u64 uaddr, u64 size, int access)

return 0;
  }
  
-int vhost_process_iotlb_msg(struct vhost_dev *dev,

-   struct vhost_iotlb_msg *msg)
+static int vhost_process_iotlb_msg(struct vhost_dev *dev,
+  struct vhost_iotlb_msg *msg)
  {
int ret = 0;
  


Patch looks good but this looks like another static conversion not 
__user annotations.


Re: [PATCH 05/10] vhost: make interval tree static inline

2016-12-06 Thread Jason Wang



On 2016年12月06日 23:40, Michael S. Tsirkin wrote:

vhost_umem_interval_tree is only used locally within vhost.c, mark it
static. As some functions generated go unused, this triggers warnings
unless we also mark it inline.

Signed-off-by: Michael S. Tsirkin 
---
  drivers/vhost/vhost.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c6f2d89..7331ef3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -49,7 +49,7 @@ enum {
  
  INTERVAL_TREE_DEFINE(struct vhost_umem_node,

 rb, __u64, __subtree_last,
-START, LAST, , vhost_umem_interval_tree);
+START, LAST, static inline, vhost_umem_interval_tree);
  
  #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY

  static void vhost_disable_cross_endian(struct vhost_virtqueue *vq)


Reviewed-by: Jason Wang 


[net-next][PATCH v2 13/18] RDS: RDMA: Fix the composite message user notification

2016-12-06 Thread Santosh Shilimkar
When application sends an RDS RDMA composite message consist of
RDMA transfer to be followed up by non RDMA payload, it expect to
be notified *only* when the full message gets delivered. RDS RDMA
notification doesn't behave this way though.

Thanks to Venkat for debug and root casuing the issue
where only first part of the message(RDMA) was
successfully delivered but remainder payload delivery failed.
In that case, application should not be notified with
a false positive of message delivery success.

Fix this case by making sure the user gets notified only after
the full message delivery.

Reviewed-by: Venkat Venkatsubra 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_send.c | 25 +++--
 net/rds/rdma.c| 10 ++
 net/rds/rds.h |  1 +
 net/rds/send.c|  4 +++-
 4 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 19eca5c..5e72de1 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -69,16 +69,6 @@ static void rds_ib_send_complete(struct rds_message *rm,
complete(rm, notify_status);
 }
 
-static void rds_ib_send_unmap_data(struct rds_ib_connection *ic,
-  struct rm_data_op *op,
-  int wc_status)
-{
-   if (op->op_nents)
-   ib_dma_unmap_sg(ic->i_cm_id->device,
-   op->op_sg, op->op_nents,
-   DMA_TO_DEVICE);
-}
-
 static void rds_ib_send_unmap_rdma(struct rds_ib_connection *ic,
   struct rm_rdma_op *op,
   int wc_status)
@@ -139,6 +129,21 @@ static void rds_ib_send_unmap_atomic(struct 
rds_ib_connection *ic,
rds_ib_stats_inc(s_ib_atomic_fadd);
 }
 
+static void rds_ib_send_unmap_data(struct rds_ib_connection *ic,
+  struct rm_data_op *op,
+  int wc_status)
+{
+   struct rds_message *rm = container_of(op, struct rds_message, data);
+
+   if (op->op_nents)
+   ib_dma_unmap_sg(ic->i_cm_id->device,
+   op->op_sg, op->op_nents,
+   DMA_TO_DEVICE);
+
+   if (rm->rdma.op_active && rm->data.op_notify)
+   rds_ib_send_unmap_rdma(ic, >rdma, wc_status);
+}
+
 /*
  * Unmap the resources associated with a struct send_work.
  *
diff --git a/net/rds/rdma.c b/net/rds/rdma.c
index 8151c49..dd508e0 100644
--- a/net/rds/rdma.c
+++ b/net/rds/rdma.c
@@ -627,6 +627,16 @@ int rds_cmsg_rdma_args(struct rds_sock *rs, struct 
rds_message *rm,
}
op->op_notifier->n_user_token = args->user_token;
op->op_notifier->n_status = RDS_RDMA_SUCCESS;
+
+   /* Enable rmda notification on data operation for composite
+* rds messages and make sure notification is enabled only
+* for the data operation which follows it so that application
+* gets notified only after full message gets delivered.
+*/
+   if (rm->data.op_sg) {
+   rm->rdma.op_notify = 0;
+   rm->data.op_notify = !!(args->flags & 
RDS_RDMA_NOTIFY_ME);
+   }
}
 
/* The cookie contains the R_Key of the remote memory region, and
diff --git a/net/rds/rds.h b/net/rds/rds.h
index ebbf909..0bb8213 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -419,6 +419,7 @@ struct rds_message {
} rdma;
struct rm_data_op {
unsigned intop_active:1;
+   unsigned intop_notify:1;
unsigned intop_nents;
unsigned intop_count;
unsigned intop_dmasg;
diff --git a/net/rds/send.c b/net/rds/send.c
index 0a6f38b..45e025b 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -476,12 +476,14 @@ void rds_rdma_send_complete(struct rds_message *rm, int 
status)
struct rm_rdma_op *ro;
struct rds_notifier *notifier;
unsigned long flags;
+   unsigned int notify = 0;
 
spin_lock_irqsave(>m_rs_lock, flags);
 
+   notify =  rm->rdma.op_notify | rm->data.op_notify;
ro = >rdma;
if (test_bit(RDS_MSG_ON_SOCK, >m_flags) &&
-   ro->op_active && ro->op_notify && ro->op_notifier) {
+   ro->op_active && notify && ro->op_notifier) {
notifier = ro->op_notifier;
rs = rm->m_rs;
sock_hold(rds_rs_to_sk(rs));
-- 
1.9.1



[net-next][PATCH v2 04/18] RDS: IB: make the transport retry count smallest

2016-12-06 Thread Santosh Shilimkar
Transport retry is not much useful since it indicate packet loss
in fabric so its better to failover fast rather than longer retry.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 45ac8e8..f4e8121 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -16,7 +16,7 @@
 #define RDS_IB_DEFAULT_SEND_WR 256
 #define RDS_IB_DEFAULT_FR_WR   512
 
-#define RDS_IB_DEFAULT_RETRY_COUNT 2
+#define RDS_IB_DEFAULT_RETRY_COUNT 1
 
 #define RDS_IB_SUPPORTED_PROTOCOLS 0x0003  /* minor versions 
supported */
 
-- 
1.9.1



[net-next][PATCH v2 10/18] RDS: IB: track and log active side endpoint in connection

2016-12-06 Thread Santosh Shilimkar
Useful to know the active and passive end points in a
RDS IB connection.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h|  3 +++
 net/rds/ib_cm.c | 11 +++
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index f14c26d..97e7696 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -181,6 +181,9 @@ struct rds_ib_connection {
 
/* Batched completions */
unsigned inti_unsignaled_wrs;
+
+   /* Endpoint role in connection */
+   int i_active_side;
 };
 
 /* This assumes that atomic_t is at least 32 bits */
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 3002acf..4d1bf04 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -120,16 +120,17 @@ void rds_ib_cm_connect_complete(struct rds_connection 
*conn, struct rdma_cm_even
rds_conn_destroy(conn);
return;
} else {
-   pr_notice("RDS/IB: connected <%pI4,%pI4> version %u.%u%s\n",
+   pr_notice("RDS/IB: %s conn connected <%pI4,%pI4> version 
%u.%u%s\n",
+ ic->i_active_side ? "Active" : "Passive",
  >c_laddr, >c_faddr,
  RDS_PROTOCOL_MAJOR(conn->c_version),
  RDS_PROTOCOL_MINOR(conn->c_version),
  ic->i_flowctl ? ", flow control" : "");
}
 
-   /*
-* Init rings and fill recv. this needs to wait until protocol 
negotiation
-* is complete, since ring layout is different from 3.0 to 3.1.
+   /* Init rings and fill recv. this needs to wait until protocol
+* negotiation is complete, since ring layout is different
+* from 3.1 to 4.1.
 */
rds_ib_send_init_ring(ic);
rds_ib_recv_init_ring(ic);
@@ -685,6 +686,7 @@ int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id)
if (ic->i_cm_id == cm_id)
ret = 0;
}
+   ic->i_active_side = true;
return ret;
 }
 
@@ -859,6 +861,7 @@ void rds_ib_conn_path_shutdown(struct rds_conn_path *cp)
ic->i_sends = NULL;
vfree(ic->i_recvs);
ic->i_recvs = NULL;
+   ic->i_active_side = false;
 }
 
 int rds_ib_conn_alloc(struct rds_connection *conn, gfp_t gfp)
-- 
1.9.1



[net-next][PATCH v2 08/18] RDS: IB: split the mr registration and invalidation path

2016-12-06 Thread Santosh Shilimkar
MR invalidation in RDS is done in background thread and not in
data path like registration. So break the dependency between them
which helps to remove the performance bottleneck.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h  |  4 +++-
 net/rds/ib_cm.c   |  9 +++--
 net/rds/ib_frmr.c | 11 ++-
 3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index f4e8121..f14c26d 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -14,7 +14,8 @@
 
 #define RDS_IB_DEFAULT_RECV_WR 1024
 #define RDS_IB_DEFAULT_SEND_WR 256
-#define RDS_IB_DEFAULT_FR_WR   512
+#define RDS_IB_DEFAULT_FR_WR   256
+#define RDS_IB_DEFAULT_FR_INV_WR   256
 
 #define RDS_IB_DEFAULT_RETRY_COUNT 1
 
@@ -125,6 +126,7 @@ struct rds_ib_connection {
 
/* To control the number of wrs from fastreg */
atomic_ti_fastreg_wrs;
+   atomic_ti_fastunreg_wrs;
 
/* interrupt handling */
struct tasklet_struct   i_send_tasklet;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index b9da1e5..3002acf 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -382,7 +382,10 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 * completion queue and send queue. This extra space is used for FRMR
 * registration and invalidation work requests
 */
-   fr_queue_space = (rds_ibdev->use_fastreg ? RDS_IB_DEFAULT_FR_WR : 0);
+   fr_queue_space = rds_ibdev->use_fastreg ?
+(RDS_IB_DEFAULT_FR_WR + 1) +
+(RDS_IB_DEFAULT_FR_INV_WR + 1)
+: 0;
 
/* add the conn now so that connection establishment has the dev */
rds_ib_add_conn(rds_ibdev, conn);
@@ -444,6 +447,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.send_cq = ic->i_send_cq;
attr.recv_cq = ic->i_recv_cq;
atomic_set(>i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR);
+   atomic_set(>i_fastunreg_wrs, RDS_IB_DEFAULT_FR_INV_WR);
 
/*
 * XXX this can fail if max_*_wr is too large?  Are we supposed
@@ -766,7 +770,8 @@ void rds_ib_conn_path_shutdown(struct rds_conn_path *cp)
wait_event(rds_ib_ring_empty_wait,
   rds_ib_ring_empty(>i_recv_ring) &&
   (atomic_read(>i_signaled_sends) == 0) &&
-  (atomic_read(>i_fastreg_wrs) == 
RDS_IB_DEFAULT_FR_WR));
+  (atomic_read(>i_fastreg_wrs) == 
RDS_IB_DEFAULT_FR_WR) &&
+  (atomic_read(>i_fastunreg_wrs) == 
RDS_IB_DEFAULT_FR_INV_WR));
tasklet_kill(>i_send_tasklet);
tasklet_kill(>i_recv_tasklet);
 
diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index 66b3d62..48332a6 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -241,8 +241,8 @@ static int rds_ib_post_inv(struct rds_ib_mr *ibmr)
if (frmr->fr_state != FRMR_IS_INUSE)
goto out;
 
-   while (atomic_dec_return(>ic->i_fastreg_wrs) <= 0) {
-   atomic_inc(>ic->i_fastreg_wrs);
+   while (atomic_dec_return(>ic->i_fastunreg_wrs) <= 0) {
+   atomic_inc(>ic->i_fastunreg_wrs);
cpu_relax();
}
 
@@ -261,7 +261,7 @@ static int rds_ib_post_inv(struct rds_ib_mr *ibmr)
if (unlikely(ret)) {
frmr->fr_state = FRMR_IS_STALE;
frmr->fr_inv = false;
-   atomic_inc(>ic->i_fastreg_wrs);
+   atomic_inc(>ic->i_fastunreg_wrs);
pr_err("RDS/IB: %s returned error(%d)\n", __func__, ret);
goto out;
}
@@ -289,9 +289,10 @@ void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
if (frmr->fr_inv) {
frmr->fr_state = FRMR_IS_FREE;
frmr->fr_inv = false;
+   atomic_inc(>i_fastreg_wrs);
+   } else {
+   atomic_inc(>i_fastunreg_wrs);
}
-
-   atomic_inc(>i_fastreg_wrs);
 }
 
 void rds_ib_unreg_frmr(struct list_head *list, unsigned int *nfreed,
-- 
1.9.1



[net-next][PATCH v2 12/18] RDS: IB: Add vector spreading for cqs

2016-12-06 Thread Santosh Shilimkar
Based on available device vectors, allocate cqs accordingly to
get better spread of completion vectors which helps performace
great deal..

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.c| 11 +++
 net/rds/ib.h|  5 +
 net/rds/ib_cm.c | 40 +---
 3 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 5680d90..8d70884 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -111,6 +111,9 @@ static void rds_ib_dev_free(struct work_struct *work)
kfree(i_ipaddr);
}
 
+   if (rds_ibdev->vector_load)
+   kfree(rds_ibdev->vector_load);
+
kfree(rds_ibdev);
 }
 
@@ -159,6 +162,14 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom;
rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom;
 
+   rds_ibdev->vector_load = kzalloc(sizeof(int) * device->num_comp_vectors,
+GFP_KERNEL);
+   if (!rds_ibdev->vector_load) {
+   pr_err("RDS/IB: %s failed to allocate vector memory\n",
+   __func__);
+   goto put_dev;
+   }
+
rds_ibdev->dev = device;
rds_ibdev->pd = ib_alloc_pd(device, 0);
if (IS_ERR(rds_ibdev->pd)) {
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 4987387..4b133b8 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -185,6 +185,10 @@ struct rds_ib_connection {
 
/* Endpoint role in connection */
int i_active_side;
+
+   /* Send/Recv vectors */
+   int i_scq_vector;
+   int i_rcq_vector;
 };
 
 /* This assumes that atomic_t is at least 32 bits */
@@ -227,6 +231,7 @@ struct rds_ib_device {
spinlock_t  spinlock;   /* protect the above */
atomic_trefcount;
struct work_struct  free_work;
+   int *vector_load;
 };
 
 #define ibdev_to_node(ibdev) dev_to_node(ibdev->dma_device)
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 4d1bf04..33c8584 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -358,6 +358,28 @@ static void rds_ib_cq_comp_handler_send(struct ib_cq *cq, 
void *context)
tasklet_schedule(>i_send_tasklet);
 }
 
+static inline int ibdev_get_unused_vector(struct rds_ib_device *rds_ibdev)
+{
+   int min = rds_ibdev->vector_load[rds_ibdev->dev->num_comp_vectors - 1];
+   int index = rds_ibdev->dev->num_comp_vectors - 1;
+   int i;
+
+   for (i = rds_ibdev->dev->num_comp_vectors - 1; i >= 0; i--) {
+   if (rds_ibdev->vector_load[i] < min) {
+   index = i;
+   min = rds_ibdev->vector_load[i];
+   }
+   }
+
+   rds_ibdev->vector_load[index]++;
+   return index;
+}
+
+static inline void ibdev_put_vector(struct rds_ib_device *rds_ibdev, int index)
+{
+   rds_ibdev->vector_load[index]--;
+}
+
 /*
  * This needs to be very careful to not leave IS_ERR pointers around for
  * cleanup to trip over.
@@ -399,25 +421,30 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
/* Protection domain and memory range */
ic->i_pd = rds_ibdev->pd;
 
+   ic->i_scq_vector = ibdev_get_unused_vector(rds_ibdev);
cq_attr.cqe = ic->i_send_ring.w_nr + fr_queue_space + 1;
-
+   cq_attr.comp_vector = ic->i_scq_vector;
ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send,
 rds_ib_cq_event_handler, conn,
 _attr);
if (IS_ERR(ic->i_send_cq)) {
ret = PTR_ERR(ic->i_send_cq);
ic->i_send_cq = NULL;
+   ibdev_put_vector(rds_ibdev, ic->i_scq_vector);
rdsdebug("ib_create_cq send failed: %d\n", ret);
goto out;
}
 
+   ic->i_rcq_vector = ibdev_get_unused_vector(rds_ibdev);
cq_attr.cqe = ic->i_recv_ring.w_nr;
+   cq_attr.comp_vector = ic->i_rcq_vector;
ic->i_recv_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_recv,
 rds_ib_cq_event_handler, conn,
 _attr);
if (IS_ERR(ic->i_recv_cq)) {
ret = PTR_ERR(ic->i_recv_cq);
ic->i_recv_cq = NULL;
+   ibdev_put_vector(rds_ibdev, ic->i_rcq_vector);
rdsdebug("ib_create_cq recv failed: %d\n", ret);
goto out;
}
@@ -780,10 +807,17 @@ void rds_ib_conn_path_shutdown(struct rds_conn_path *cp)
/* first destroy the ib state that generates callbacks */
if (ic->i_cm_id->qp)
rdma_destroy_qp(ic->i_cm_id);
-   if (ic->i_send_cq)
+   if (ic->i_send_cq) {
+   if (ic->rds_ibdev)
+

[net-next][PATCH v2 16/18] RDS: make message size limit compliant with spec

2016-12-06 Thread Santosh Shilimkar
From: Avinash Repaka 

RDS support max message size as 1M but the code doesn't check this
in all cases. Patch fixes it for RDMA & non-RDMA and RDS MR size
and its enforced irrespective of underlying transport.

Signed-off-by: Avinash Repaka 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma.c |  9 -
 net/rds/rds.h  |  3 +++
 net/rds/send.c | 31 +++
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/net/rds/rdma.c b/net/rds/rdma.c
index dd508e0..6e8db33 100644
--- a/net/rds/rdma.c
+++ b/net/rds/rdma.c
@@ -40,7 +40,6 @@
 /*
  * XXX
  *  - build with sparse
- *  - should we limit the size of a mr region?  let transport return failure?
  *  - should we detect duplicate keys on a socket?  hmm.
  *  - an rdma is an mlock, apply rlimit?
  */
@@ -200,6 +199,14 @@ static int __rds_rdma_map(struct rds_sock *rs, struct 
rds_get_mr_args *args,
goto out;
}
 
+   /* Restrict the size of mr irrespective of underlying transport
+* To account for unaligned mr regions, subtract one from nr_pages
+*/
+   if ((nr_pages - 1) > (RDS_MAX_MSG_SIZE >> PAGE_SHIFT)) {
+   ret = -EMSGSIZE;
+   goto out;
+   }
+
rdsdebug("RDS: get_mr addr %llx len %llu nr_pages %u\n",
args->vec.addr, args->vec.bytes, nr_pages);
 
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 8ccd5a9..f713194 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -50,6 +50,9 @@ void rdsdebug(char *fmt, ...)
 #define RDS_FRAG_SHIFT 12
 #define RDS_FRAG_SIZE  ((unsigned int)(1 << RDS_FRAG_SHIFT))
 
+/* Used to limit both RDMA and non-RDMA RDS message to 1MB */
+#define RDS_MAX_MSG_SIZE   ((unsigned int)(1 << 20))
+
 #define RDS_CONG_MAP_BYTES (65536 / 8)
 #define RDS_CONG_MAP_PAGES (PAGE_ALIGN(RDS_CONG_MAP_BYTES) / PAGE_SIZE)
 #define RDS_CONG_MAP_PAGE_BITS (PAGE_SIZE * 8)
diff --git a/net/rds/send.c b/net/rds/send.c
index 45e025b..5cc6403 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -994,6 +994,26 @@ static int rds_send_mprds_hash(struct rds_sock *rs, struct 
rds_connection *conn)
return hash;
 }
 
+static int rds_rdma_bytes(struct msghdr *msg, size_t *rdma_bytes)
+{
+   struct rds_rdma_args *args;
+   struct cmsghdr *cmsg;
+
+   for_each_cmsghdr(cmsg, msg) {
+   if (!CMSG_OK(msg, cmsg))
+   return -EINVAL;
+
+   if (cmsg->cmsg_level != SOL_RDS)
+   continue;
+
+   if (cmsg->cmsg_type == RDS_CMSG_RDMA_ARGS) {
+   args = CMSG_DATA(cmsg);
+   *rdma_bytes += args->remote_vec.bytes;
+   }
+   }
+   return 0;
+}
+
 int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 {
struct sock *sk = sock->sk;
@@ -1008,6 +1028,7 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
int nonblock = msg->msg_flags & MSG_DONTWAIT;
long timeo = sock_sndtimeo(sk, nonblock);
struct rds_conn_path *cpath;
+   size_t total_payload_len = payload_len, rdma_payload_len = 0;
 
/* Mirror Linux UDP mirror of BSD error message compatibility */
/* XXX: Perhaps MSG_MORE someday */
@@ -1040,6 +1061,16 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
}
release_sock(sk);
 
+   ret = rds_rdma_bytes(msg, _payload_len);
+   if (ret)
+   goto out;
+
+   total_payload_len += rdma_payload_len;
+   if (max_t(size_t, payload_len, rdma_payload_len) > RDS_MAX_MSG_SIZE) {
+   ret = -EMSGSIZE;
+   goto out;
+   }
+
if (payload_len > rds_sk_sndbuf(rs)) {
ret = -EMSGSIZE;
goto out;
-- 
1.9.1



[net-next][PATCH v2 14/18] RDS: IB: fix panic due to handlers running post teardown

2016-12-06 Thread Santosh Shilimkar
Shutdown code reaping loop takes care of emptying the
CQ's before they being destroyed. And once tasklets are
killed, the hanlders are not expected to run.

But because of core tasklet code issues, tasklet handler could
still run even after tasklet_kill,
RDS IB shutdown code already reaps the CQs before freeing
cq/qp resources so as such the handlers have nothing left
to do post shutdown.

On other hand any handler running after teardown and trying
to access already freed qp/cq resources causes issues
Patch fixes this race by  makes sure that handlers returns
without any action post teardown.

Reviewed-by: Wengang 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h|  1 +
 net/rds/ib_cm.c | 12 
 2 files changed, 13 insertions(+)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 4b133b8..8efd1eb 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -185,6 +185,7 @@ struct rds_ib_connection {
 
/* Endpoint role in connection */
int i_active_side;
+   atomic_ti_cq_quiesce;
 
/* Send/Recv vectors */
int i_scq_vector;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 33c8584..ce3775a 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -128,6 +128,8 @@ void rds_ib_cm_connect_complete(struct rds_connection 
*conn, struct rdma_cm_even
  ic->i_flowctl ? ", flow control" : "");
}
 
+   atomic_set(>i_cq_quiesce, 0);
+
/* Init rings and fill recv. this needs to wait until protocol
 * negotiation is complete, since ring layout is different
 * from 3.1 to 4.1.
@@ -267,6 +269,10 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
+   /* if cq has been already reaped, ignore incoming cq event */
+   if (atomic_read(>i_cq_quiesce))
+   return;
+
poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
@@ -308,6 +314,10 @@ static void rds_ib_tasklet_fn_recv(unsigned long data)
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
+   /* if cq has been already reaped, ignore incoming cq event */
+   if (atomic_read(>i_cq_quiesce))
+   return;
+
memset(, 0, sizeof(state));
poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, );
ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
@@ -804,6 +814,8 @@ void rds_ib_conn_path_shutdown(struct rds_conn_path *cp)
tasklet_kill(>i_send_tasklet);
tasklet_kill(>i_recv_tasklet);
 
+   atomic_set(>i_cq_quiesce, 1);
+
/* first destroy the ib state that generates callbacks */
if (ic->i_cm_id->qp)
rdma_destroy_qp(ic->i_cm_id);
-- 
1.9.1



[net-next][PATCH v2 11/18] RDS: IB: add few useful cache stasts

2016-12-06 Thread Santosh Shilimkar
Tracks the ib receive cache total, incoming and frag allocations.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h   | 7 +++
 net/rds/ib_recv.c  | 6 ++
 net/rds/ib_stats.c | 2 ++
 3 files changed, 15 insertions(+)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 97e7696..4987387 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -151,6 +151,7 @@ struct rds_ib_connection {
u64 i_ack_recv; /* last ACK received */
struct rds_ib_refill_cache i_cache_incs;
struct rds_ib_refill_cache i_cache_frags;
+   atomic_ti_cache_allocs;
 
/* sending acks */
unsigned long   i_ack_flags;
@@ -254,6 +255,8 @@ struct rds_ib_statistics {
uint64_ts_ib_rx_refill_from_cq;
uint64_ts_ib_rx_refill_from_thread;
uint64_ts_ib_rx_alloc_limit;
+   uint64_ts_ib_rx_total_frags;
+   uint64_ts_ib_rx_total_incs;
uint64_ts_ib_rx_credit_updates;
uint64_ts_ib_ack_sent;
uint64_ts_ib_ack_send_failure;
@@ -276,6 +279,8 @@ struct rds_ib_statistics {
uint64_ts_ib_rdma_mr_1m_reused;
uint64_ts_ib_atomic_cswp;
uint64_ts_ib_atomic_fadd;
+   uint64_ts_ib_recv_added_to_cache;
+   uint64_ts_ib_recv_removed_from_cache;
 };
 
 extern struct workqueue_struct *rds_ib_wq;
@@ -406,6 +411,8 @@ int rds_ib_send_grab_credits(struct rds_ib_connection *ic, 
u32 wanted,
 /* ib_stats.c */
 DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats);
 #define rds_ib_stats_inc(member) rds_stats_inc_which(rds_ib_stats, member)
+#define rds_ib_stats_add(member, count) \
+   rds_stats_add_which(rds_ib_stats, member, count)
 unsigned int rds_ib_stats_info_copy(struct rds_info_iterator *iter,
unsigned int avail);
 
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 6803b75..4b0f126 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -194,6 +194,8 @@ static void rds_ib_frag_free(struct rds_ib_connection *ic,
rdsdebug("frag %p page %p\n", frag, sg_page(>f_sg));
 
rds_ib_recv_cache_put(>f_cache_entry, >i_cache_frags);
+   atomic_add(RDS_FRAG_SIZE / SZ_1K, >i_cache_allocs);
+   rds_ib_stats_add(s_ib_recv_added_to_cache, RDS_FRAG_SIZE);
 }
 
 /* Recycle inc after freeing attached frags */
@@ -261,6 +263,7 @@ static struct rds_ib_incoming *rds_ib_refill_one_inc(struct 
rds_ib_connection *i
atomic_dec(_ib_allocation);
return NULL;
}
+   rds_ib_stats_inc(s_ib_rx_total_incs);
}
INIT_LIST_HEAD(>ii_frags);
rds_inc_init(>ii_inc, ic->conn, ic->conn->c_faddr);
@@ -278,6 +281,8 @@ static struct rds_page_frag *rds_ib_refill_one_frag(struct 
rds_ib_connection *ic
cache_item = rds_ib_recv_cache_get(>i_cache_frags);
if (cache_item) {
frag = container_of(cache_item, struct rds_page_frag, 
f_cache_entry);
+   atomic_sub(RDS_FRAG_SIZE / SZ_1K, >i_cache_allocs);
+   rds_ib_stats_add(s_ib_recv_added_to_cache, RDS_FRAG_SIZE);
} else {
frag = kmem_cache_alloc(rds_ib_frag_slab, slab_mask);
if (!frag)
@@ -290,6 +295,7 @@ static struct rds_page_frag *rds_ib_refill_one_frag(struct 
rds_ib_connection *ic
kmem_cache_free(rds_ib_frag_slab, frag);
return NULL;
}
+   rds_ib_stats_inc(s_ib_rx_total_frags);
}
 
INIT_LIST_HEAD(>f_item);
diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c
index 7e78dca..9252ad1 100644
--- a/net/rds/ib_stats.c
+++ b/net/rds/ib_stats.c
@@ -55,6 +55,8 @@
"ib_rx_refill_from_cq",
"ib_rx_refill_from_thread",
"ib_rx_alloc_limit",
+   "ib_rx_total_frags",
+   "ib_rx_total_incs",
"ib_rx_credit_updates",
"ib_ack_sent",
"ib_ack_send_failure",
-- 
1.9.1



[net-next][PATCH v2 09/18] RDS: RDMA: silence the use_once mr log flood

2016-12-06 Thread Santosh Shilimkar
In absence of extension headers, message log will keep
flooding the console. As such even without use_once we can
clean up the MRs so its not really an error case message
so make it debug message

Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/rds/rdma.c b/net/rds/rdma.c
index 4c93bad..8151c49 100644
--- a/net/rds/rdma.c
+++ b/net/rds/rdma.c
@@ -415,7 +415,8 @@ void rds_rdma_unuse(struct rds_sock *rs, u32 r_key, int 
force)
spin_lock_irqsave(>rs_rdma_lock, flags);
mr = rds_mr_tree_walk(>rs_rdma_keys, r_key, NULL);
if (!mr) {
-   printk(KERN_ERR "rds: trying to unuse MR with unknown r_key 
%u!\n", r_key);
+   pr_debug("rds: trying to unuse MR with unknown r_key %u!\n",
+r_key);
spin_unlock_irqrestore(>rs_rdma_lock, flags);
return;
}
-- 
1.9.1



[net-next][PATCH v2 01/18] RDS: log the address on bind failure

2016-12-06 Thread Santosh Shilimkar
It's useful to know the IP address when RDS fails to bind a
connection. Thus, adding it to the error message.

Orabug: 21894138
Reviewed-by: Wei Lin Guay 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/bind.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/bind.c b/net/rds/bind.c
index 095f6ce..3a915be 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -176,8 +176,8 @@ int rds_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
if (!trans) {
ret = -EADDRNOTAVAIL;
rds_remove_bound(rs);
-   printk_ratelimited(KERN_INFO "RDS: rds_bind() could not find a 
transport, "
-   "load rds_tcp or rds_rdma?\n");
+   pr_info_ratelimited("RDS: %s could not find a transport for 
%pI4, load rds_tcp or rds_rdma?\n",
+   __func__, >sin_addr.s_addr);
goto out;
}
 
-- 
1.9.1



[net-next][PATCH v2 18/18] RDS: IB: add missing connection cache usage info

2016-12-06 Thread Santosh Shilimkar
rds-tools already support it.

Signed-off-by: Santosh Shilimkar 
---
 include/uapi/linux/rds.h | 1 +
 net/rds/ib.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 3833113..410ae3c 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -183,6 +183,7 @@ struct rds_info_rdma_connection {
uint32_tmax_send_sge;
uint32_trdma_mr_max;
uint32_trdma_mr_size;
+   uint32_tcache_allocs;
 };
 
 /* RDS message Receive Path Latency points */
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 8d70884..b5e2699 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -313,6 +313,7 @@ static int rds_ib_conn_info_visitor(struct rds_connection 
*conn,
iinfo->max_send_wr = ic->i_send_ring.w_nr;
iinfo->max_recv_wr = ic->i_recv_ring.w_nr;
iinfo->max_send_sge = rds_ibdev->max_sge;
+   iinfo->cache_allocs = atomic_read(>i_cache_allocs);
rds_ib_get_mr_info(rds_ibdev, iinfo);
}
return 1;
-- 
1.9.1



[net-next][PATCH v2 15/18] RDS: add stat for socket recv memory usage

2016-12-06 Thread Santosh Shilimkar
From: Venkat Venkatsubra 

Tracks the receive side memory added to scokets and removed from sockets.

Signed-off-by: Venkat Venkatsubra 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rds.h  | 3 +++
 net/rds/recv.c | 4 
 2 files changed, 7 insertions(+)

diff --git a/net/rds/rds.h b/net/rds/rds.h
index 0bb8213..8ccd5a9 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -631,6 +631,9 @@ struct rds_statistics {
uint64_ts_cong_update_received;
uint64_ts_cong_send_error;
uint64_ts_cong_send_blocked;
+   uint64_ts_recv_bytes_added_to_socket;
+   uint64_ts_recv_bytes_removed_from_socket;
+
 };
 
 /* af_rds.c */
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 9d0666e..ba19eee 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -94,6 +94,10 @@ static void rds_recv_rcvbuf_delta(struct rds_sock *rs, 
struct sock *sk,
return;
 
rs->rs_rcv_bytes += delta;
+   if (delta > 0)
+   rds_stats_add(s_recv_bytes_added_to_socket, delta);
+   else
+   rds_stats_add(s_recv_bytes_removed_from_socket, -delta);
now_congested = rs->rs_rcv_bytes > rds_sk_rcvbuf(rs);
 
rdsdebug("rs %p (%pI4:%u) recv bytes %d buf %d "
-- 
1.9.1



linux-next: manual merge of the staging tree with the net-next tree

2016-12-06 Thread Stephen Rothwell
Hi Greg,

Today's linux-next merge of the staging tree got a conflict in:

  drivers/staging/slicoss/slicoss.c

between commit:

  a52ad514fdf3 ("net: deprecate eth_change_mtu, remove usage")

from the net-next tree and commit:

  0af72df267f2 ("staging: slicoss: remove the staging driver")

from the staging tree.

I fixed it up (I just removed the file) and can carry the fix as
necessary. This is now fixed as far as linux-next is concerned, but any
non trivial conflicts should be mentioned to your upstream maintainer
when your tree is submitted for merging.  You may also want to consider
cooperating with the maintainer of the conflicting tree to minimise any
particularly complex conflicts.

-- 
Cheers,
Stephen Rothwell


[net-next][PATCH v2 17/18] RDS: add receive message trace used by application

2016-12-06 Thread Santosh Shilimkar
Socket option to tap receive path latency in various stages
in nano seconds. It can be enabled on selective sockets using
using SO_RDS_MSG_RXPATH_LATENCY socket option. RDS will return
the data to application with RDS_CMSG_RXPATH_LATENCY in defined
format. Scope is left to add more trace points for future
without need of change in the interface.

Reviewed-by: Sowmini Varadhan 
Signed-off-by: Santosh Shilimkar 
---
 include/uapi/linux/rds.h | 33 +
 net/rds/af_rds.c | 28 
 net/rds/ib_recv.c|  4 
 net/rds/rds.h| 10 ++
 net/rds/recv.c   | 32 +---
 net/rds/tcp_recv.c   |  5 +
 6 files changed, 109 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 0f9265c..3833113 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -52,6 +52,13 @@
 #define RDS_GET_MR_FOR_DEST7
 #define SO_RDS_TRANSPORT   8
 
+/* Socket option to tap receive path latency
+ * SO_RDS: SO_RDS_MSG_RXPATH_LATENCY
+ * Format used struct rds_rx_trace_so
+ */
+#define SO_RDS_MSG_RXPATH_LATENCY  10
+
+
 /* supported values for SO_RDS_TRANSPORT */
 #defineRDS_TRANS_IB0
 #defineRDS_TRANS_IWARP 1
@@ -77,6 +84,12 @@
  * the same as for the GET_MR setsockopt.
  * RDS_CMSG_RDMA_STATUS (recvmsg)
  * Returns the status of a completed RDMA operation.
+ * RDS_CMSG_RXPATH_LATENCY(recvmsg)
+ * Returns rds message latencies in various stages of receive
+ * path in nS. Its set per socket using SO_RDS_MSG_RXPATH_LATENCY
+ * socket option. Legitimate points are defined in
+ * enum rds_message_rxpath_latency. More points can be added in
+ * future. CSMG format is struct rds_cmsg_rx_trace.
  */
 #define RDS_CMSG_RDMA_ARGS 1
 #define RDS_CMSG_RDMA_DEST 2
@@ -87,6 +100,7 @@
 #define RDS_CMSG_ATOMIC_CSWP   7
 #define RDS_CMSG_MASKED_ATOMIC_FADD8
 #define RDS_CMSG_MASKED_ATOMIC_CSWP9
+#define RDS_CMSG_RXPATH_LATENCY11
 
 #define RDS_INFO_FIRST 1
 #define RDS_INFO_COUNTERS  1
@@ -171,6 +185,25 @@ struct rds_info_rdma_connection {
uint32_trdma_mr_size;
 };
 
+/* RDS message Receive Path Latency points */
+enum rds_message_rxpath_latency {
+   RDS_MSG_RX_HDR_TO_DGRAM_START = 0,
+   RDS_MSG_RX_DGRAM_REASSEMBLE,
+   RDS_MSG_RX_DGRAM_DELIVERED,
+   RDS_MSG_RX_DGRAM_TRACE_MAX
+};
+
+struct rds_rx_trace_so {
+   u8 rx_traces;
+   u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
+};
+
+struct rds_cmsg_rx_trace {
+   u8 rx_traces;
+   u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
+   u64 rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX];
+};
+
 /*
  * Congestion monitoring.
  * Congestion control in RDS happens at the host connection
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 2ac1e61..fd821740 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -298,6 +298,30 @@ static int rds_enable_recvtstamp(struct sock *sk, char 
__user *optval,
return 0;
 }
 
+static int rds_recv_track_latency(struct rds_sock *rs, char __user *optval,
+ int optlen)
+{
+   struct rds_rx_trace_so trace;
+   int i;
+
+   if (optlen != sizeof(struct rds_rx_trace_so))
+   return -EFAULT;
+
+   if (copy_from_user(, optval, sizeof(trace)))
+   return -EFAULT;
+
+   rs->rs_rx_traces = trace.rx_traces;
+   for (i = 0; i < rs->rs_rx_traces; i++) {
+   if (trace.rx_trace_pos[i] > RDS_MSG_RX_DGRAM_TRACE_MAX) {
+   rs->rs_rx_traces = 0;
+   return -EFAULT;
+   }
+   rs->rs_rx_trace[i] = trace.rx_trace_pos[i];
+   }
+
+   return 0;
+}
+
 static int rds_setsockopt(struct socket *sock, int level, int optname,
  char __user *optval, unsigned int optlen)
 {
@@ -338,6 +362,9 @@ static int rds_setsockopt(struct socket *sock, int level, 
int optname,
ret = rds_enable_recvtstamp(sock->sk, optval, optlen);
release_sock(sock->sk);
break;
+   case SO_RDS_MSG_RXPATH_LATENCY:
+   ret = rds_recv_track_latency(rs, optval, optlen);
+   break;
default:
ret = -ENOPROTOOPT;
}
@@ -484,6 +511,7 @@ static int __rds_create(struct socket *sock, struct sock 
*sk, int protocol)
INIT_LIST_HEAD(>rs_cong_list);
spin_lock_init(>rs_rdma_lock);
rs->rs_rdma_keys = RB_ROOT;
+   rs->rs_rx_traces = 0;
 
spin_lock_bh(_sock_lock);
list_add_tail(>rs_item, _sock_list);
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 4b0f126..e10624a 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -911,8 +911,12 @@ static void 

[net-next][PATCH v2 06/18] RDS: RDMA: start rdma listening after init

2016-12-06 Thread Santosh Shilimkar
From: Qing Huang 

This prevents RDS from handling incoming rdma packets before RDS
completes initializing its recv/send components.

Signed-off-by: Qing Huang 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma_transport.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 345f090..250c6b8 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -203,18 +203,13 @@ static int rds_rdma_init(void)
 {
int ret;
 
-   ret = rds_rdma_listen_init();
+   ret = rds_ib_init();
if (ret)
goto out;
 
-   ret = rds_ib_init();
+   ret = rds_rdma_listen_init();
if (ret)
-   goto err_ib_init;
-
-   goto out;
-
-err_ib_init:
-   rds_rdma_listen_stop();
+   rds_ib_exit();
 out:
return ret;
 }
-- 
1.9.1



[net-next][PATCH v2 03/18] RDS: IB: include faddr in connection log

2016-12-06 Thread Santosh Shilimkar
Also use pr_* for it.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_cm.c   | 19 +--
 net/rds/ib_recv.c |  4 ++--
 net/rds/ib_send.c |  4 ++--
 3 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 5b2ab95..b9da1e5 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -113,19 +113,18 @@ void rds_ib_cm_connect_complete(struct rds_connection 
*conn, struct rdma_cm_even
}
 
if (conn->c_version < RDS_PROTOCOL(3, 1)) {
-   printk(KERN_NOTICE "RDS/IB: Connection to %pI4 version %u.%u 
failed,"
-  " no longer supported\n",
-  >c_faddr,
-  RDS_PROTOCOL_MAJOR(conn->c_version),
-  RDS_PROTOCOL_MINOR(conn->c_version));
+   pr_notice("RDS/IB: Connection <%pI4,%pI4> version %u.%u no 
longer supported\n",
+ >c_laddr, >c_faddr,
+ RDS_PROTOCOL_MAJOR(conn->c_version),
+ RDS_PROTOCOL_MINOR(conn->c_version));
rds_conn_destroy(conn);
return;
} else {
-   printk(KERN_NOTICE "RDS/IB: connected to %pI4 version 
%u.%u%s\n",
-  >c_faddr,
-  RDS_PROTOCOL_MAJOR(conn->c_version),
-  RDS_PROTOCOL_MINOR(conn->c_version),
-  ic->i_flowctl ? ", flow control" : "");
+   pr_notice("RDS/IB: connected <%pI4,%pI4> version %u.%u%s\n",
+ >c_laddr, >c_faddr,
+ RDS_PROTOCOL_MAJOR(conn->c_version),
+ RDS_PROTOCOL_MINOR(conn->c_version),
+ ic->i_flowctl ? ", flow control" : "");
}
 
/*
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 606a11f..6803b75 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -980,8 +980,8 @@ void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic,
} else {
/* We expect errors as the qp is drained during shutdown */
if (rds_conn_up(conn) || rds_conn_connecting(conn))
-   rds_ib_conn_error(conn, "recv completion on %pI4 had 
status %u (%s), disconnecting and reconnecting\n",
- >c_faddr,
+   rds_ib_conn_error(conn, "recv completion on <%pI4,%pI4> 
had status %u (%s), disconnecting and reconnecting\n",
+ >c_laddr, >c_faddr,
  wc->status,
  ib_wc_status_msg(wc->status));
}
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 84d90c9..19eca5c 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -300,8 +300,8 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
 
/* We expect errors as the qp is drained during shutdown */
if (wc->status != IB_WC_SUCCESS && rds_conn_up(conn)) {
-   rds_ib_conn_error(conn, "send completion on %pI4 had status %u 
(%s), disconnecting and reconnecting\n",
- >c_faddr, wc->status,
+   rds_ib_conn_error(conn, "send completion on <%pI4,%pI4> had 
status %u (%s), disconnecting and reconnecting\n",
+ >c_laddr, >c_faddr, wc->status,
  ib_wc_status_msg(wc->status));
}
 }
-- 
1.9.1



[net-next][PATCH v2 07/18] RDS: RDMA: return appropriate error on rdma map failures

2016-12-06 Thread Santosh Shilimkar
The first message to a remote node should prompt a new
connection even if it is RDMA operation. For RDMA operation
the MR mapping can fail because connections is not yet up.

Since the connection establishment is asynchronous,
we make sure the map failure because of unavailable
connection reach to the user by appropriate error code.
Before returning to the user, lets trigger the connection
so that its ready for the next retry.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/send.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index bb13c56..0a6f38b 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -945,6 +945,11 @@ static int rds_cmsg_send(struct rds_sock *rs, struct 
rds_message *rm,
ret = rds_cmsg_rdma_map(rs, rm, cmsg);
if (!ret)
*allocated_mr = 1;
+   else if (ret == -ENODEV)
+   /* Accommodate the get_mr() case which can fail
+* if connection isn't established yet.
+*/
+   ret = -EAGAIN;
break;
case RDS_CMSG_ATOMIC_CSWP:
case RDS_CMSG_ATOMIC_FADD:
@@ -1082,8 +1087,12 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
 
/* Parse any control messages the user may have included. */
ret = rds_cmsg_send(rs, rm, msg, _mr);
-   if (ret)
+   if (ret) {
+   /* Trigger connection so that its ready for the next retry */
+   if (ret ==  -EAGAIN)
+   rds_conn_connect_if_down(conn);
goto out;
+   }
 
if (rm->rdma.op_active && !conn->c_trans->xmit_rdma) {
printk_ratelimited(KERN_NOTICE "rdma_op %p conn xmit_rdma %p\n",
-- 
1.9.1



[net-next][PATCH v2 00/18] net: RDS updates

2016-12-06 Thread Santosh Shilimkar
v1->v2:
Re-aligned indentation in patch 'RDS: mark few internal functions

Series consist of:
 - RDMA transport fixes for map failure, listen sequence, handler panic and
   composite message notification.
 - Couple of sparse fixes.
 - Message logging improvements for bind failure, use once mr semantics
   and connection remote address, active end point.
 - Performance improvement for RDMA transport by reducing the post send
   pressure on the queue and spreading the CQ vectors.
 - Useful statistics for socket send/recv usage and receive cache usage.
   rds-tools already equipped to parse this info.
 - Additional RDS CMSG used by application to track the RDS message
   stages for certain type of traffic to find out latency spots.
   Can be enabled/disabled per socket.

Series generated against 'net-next'. Full patchset is also available on
below git tree.

The following changes since commit adc176c5472214971d77c1a61c83db9b01e9cdc7:

  ipv6 addrconf: Implemented enhanced DAD (RFC7527) (2016-12-03 23:21:37 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git 
for_4.10/net-next/rds_v2

for you to fetch changes up to ccb54a64f5e73a7c3ea9f74effdac0bdd1f27ac4:

  RDS: IB: add missing connection cache usage info (2016-12-05 17:20:27 -0800)


Avinash Repaka (1):
  RDS: make message size limit compliant with spec

Qing Huang (1):
  RDS: RDMA: start rdma listening after init

Santosh Shilimkar (15):
  RDS: log the address on bind failure
  RDS: mark few internal functions static to make sparse build happy
  RDS: IB: include faddr in connection log
  RDS: IB: make the transport retry count smallest
  RDS: RDMA: fix the ib_map_mr_sg_zbva() argument
  RDS: RDMA: return appropriate error on rdma map failures
  RDS: IB: split the mr registration and invalidation path
  RDS: RDMA: silence the use_once mr log flood
  RDS: IB: track and log active side endpoint in connection
  RDS: IB: add few useful cache stasts
  RDS: IB: Add vector spreading for cqs
  RDS: RDMA: Fix the composite message user notification
  RDS: IB: fix panic due to handlers running post teardown
  RDS: add receive message trace used by application
  RDS: IB: add missing connection cache usage info

Venkat Venkatsubra (1):
  RDS: add stat for socket recv memory usage

 include/uapi/linux/rds.h | 34 ++
 net/rds/af_rds.c | 28 +++
 net/rds/bind.c   |  4 +--
 net/rds/connection.c | 10 +++---
 net/rds/ib.c | 12 +++
 net/rds/ib.h | 22 ++--
 net/rds/ib_cm.c  | 89 ++--
 net/rds/ib_frmr.c| 16 +
 net/rds/ib_recv.c| 14 ++--
 net/rds/ib_send.c| 29 +---
 net/rds/ib_stats.c   |  2 ++
 net/rds/rdma.c   | 22 ++--
 net/rds/rdma_transport.c | 11 ++
 net/rds/rds.h| 17 +
 net/rds/recv.c   | 36 ++--
 net/rds/send.c   | 50 ---
 net/rds/tcp_listen.c |  1 +
 net/rds/tcp_recv.c   |  5 +++
 18 files changed, 337 insertions(+), 65 deletions(-)

-- 
1.9.1



[net-next][PATCH v2 02/18] RDS: mark few internal functions static to make sparse build happy

2016-12-06 Thread Santosh Shilimkar
Fixes below warnings:
warning: symbol 'rds_send_probe' was not declared. Should it be static?
warning: symbol 'rds_send_ping' was not declared. Should it be static?
warning: symbol 'rds_tcp_accept_one_path' was not declared. Should it be static?
warning: symbol 'rds_walk_conn_path_info' was not declared. Should it be static?

Signed-off-by: Santosh Shilimkar 
---
 net/rds/connection.c | 10 +-
 net/rds/send.c   |  4 ++--
 net/rds/tcp_listen.c |  1 +
 3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index fe9d31c..0e04dcc 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -545,11 +545,11 @@ void rds_for_each_conn_info(struct socket *sock, unsigned 
int len,
 }
 EXPORT_SYMBOL_GPL(rds_for_each_conn_info);
 
-void rds_walk_conn_path_info(struct socket *sock, unsigned int len,
-struct rds_info_iterator *iter,
-struct rds_info_lengths *lens,
-int (*visitor)(struct rds_conn_path *, void *),
-size_t item_len)
+static void rds_walk_conn_path_info(struct socket *sock, unsigned int len,
+   struct rds_info_iterator *iter,
+   struct rds_info_lengths *lens,
+   int (*visitor)(struct rds_conn_path *, void 
*),
+   size_t item_len)
 {
u64  buffer[(item_len + 7) / 8];
struct hlist_head *head;
diff --git a/net/rds/send.c b/net/rds/send.c
index 77c8c6e..bb13c56 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -1169,7 +1169,7 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
  * or
  *   RDS_FLAG_HB_PONG|RDS_FLAG_ACK_REQUIRED
  */
-int
+static int
 rds_send_probe(struct rds_conn_path *cp, __be16 sport,
   __be16 dport, u8 h_flags)
 {
@@ -1238,7 +1238,7 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
return rds_send_probe(cp, 0, dport, 0);
 }
 
-void
+static void
 rds_send_ping(struct rds_connection *conn)
 {
unsigned long flags;
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index f74bab3..67d0929 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -79,6 +79,7 @@ int rds_tcp_keepalive(struct socket *sock)
  * smaller ip address, we recycle conns in RDS_CONN_ERROR on the passive side
  * by moving them to CONNECTING in this function.
  */
+static
 struct rds_tcp_connection *rds_tcp_accept_one_path(struct rds_connection *conn)
 {
int i;
-- 
1.9.1



[net-next][PATCH v2 05/18] RDS: RDMA: fix the ib_map_mr_sg_zbva() argument

2016-12-06 Thread Santosh Shilimkar
Fixes warning: Using plain integer as NULL pointer

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_frmr.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index d921adc..66b3d62 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -104,14 +104,15 @@ static int rds_ib_post_reg_frmr(struct rds_ib_mr *ibmr)
struct rds_ib_frmr *frmr = >u.frmr;
struct ib_send_wr *failed_wr;
struct ib_reg_wr reg_wr;
-   int ret;
+   int ret, off = 0;
 
while (atomic_dec_return(>ic->i_fastreg_wrs) <= 0) {
atomic_inc(>ic->i_fastreg_wrs);
cpu_relax();
}
 
-   ret = ib_map_mr_sg_zbva(frmr->mr, ibmr->sg, ibmr->sg_len, 0, PAGE_SIZE);
+   ret = ib_map_mr_sg_zbva(frmr->mr, ibmr->sg, ibmr->sg_len,
+   , PAGE_SIZE);
if (unlikely(ret != ibmr->sg_len))
return ret < 0 ? ret : -EINVAL;
 
-- 
1.9.1



[PATCH net-next 0/2] Initial driver for Synopsys DWC XLGMAC

2016-12-06 Thread Jie Deng
This series provides the support for 25/40/50/100 GbE
devices using Synopsys DWC Enterprise Ethernet (XLGMAC).

The first patch adds support for Synopsys XLGMII.
The second patch provides the initial driver for Synopsys XLGMAC

The driver has three layers by refactoring AMD XGBE.

dwc-eth-xxx.x
  The DWC ethernet core layer (DWC ECL). This layer contains codes
can be shared by different DWC series ethernet cores

dwc-xxx.x (e.g. dwc-xlgmac.c)
  The DWC MAC HW adapter layer (DWC MHAL). This layer contains
special support for a specific MAC. e.g. currently, XLGMAC.

xxx-xxx-pci.c xxx-xxx-plat.c (e.g. dwc-xlgmac-pci.c)
  The glue adapter layer (GAL). Vendors who adopt Synopsys Etherent
cores can develop a glue driver for their platform.

Jie Deng (2):
  net: phy: add extension of phy-mode for XLGMII
  net: ethernet: Initial driver for Synopsys DWC XLGMAC

 Documentation/devicetree/bindings/net/ethernet.txt |1 +
 MAINTAINERS|6 +
 drivers/net/ethernet/synopsys/Kconfig  |2 +
 drivers/net/ethernet/synopsys/Makefile |1 +
 drivers/net/ethernet/synopsys/dwc/Kconfig  |   37 +
 drivers/net/ethernet/synopsys/dwc/Makefile |9 +
 drivers/net/ethernet/synopsys/dwc/dwc-eth-dcb.c|  228 ++
 .../net/ethernet/synopsys/dwc/dwc-eth-debugfs.c|  328 +++
 drivers/net/ethernet/synopsys/dwc/dwc-eth-desc.c   |  715 +
 .../net/ethernet/synopsys/dwc/dwc-eth-ethtool.c|  567 
 drivers/net/ethernet/synopsys/dwc/dwc-eth-hw.c | 3098 
 drivers/net/ethernet/synopsys/dwc/dwc-eth-mdio.c   |  252 ++
 drivers/net/ethernet/synopsys/dwc/dwc-eth-net.c| 2319 +++
 drivers/net/ethernet/synopsys/dwc/dwc-eth-ptp.c|  216 ++
 drivers/net/ethernet/synopsys/dwc/dwc-eth-regacc.h | 1115 +++
 drivers/net/ethernet/synopsys/dwc/dwc-eth.h|  738 +
 drivers/net/ethernet/synopsys/dwc/dwc-xlgmac-pci.c |  538 
 drivers/net/ethernet/synopsys/dwc/dwc-xlgmac.c |  135 +
 drivers/net/ethernet/synopsys/dwc/dwc-xlgmac.h |   85 +
 include/linux/phy.h|3 +
 20 files changed, 10393 insertions(+)
 create mode 100644 drivers/net/ethernet/synopsys/dwc/Kconfig
 create mode 100644 drivers/net/ethernet/synopsys/dwc/Makefile
 create mode 100644 drivers/net/ethernet/synopsys/dwc/dwc-eth-dcb.c
 create mode 100644 drivers/net/ethernet/synopsys/dwc/dwc-eth-debugfs.c
 create mode 100644 drivers/net/ethernet/synopsys/dwc/dwc-eth-desc.c
 create mode 100644 drivers/net/ethernet/synopsys/dwc/dwc-eth-ethtool.c
 create mode 100644 drivers/net/ethernet/synopsys/dwc/dwc-eth-hw.c
 create mode 100644 drivers/net/ethernet/synopsys/dwc/dwc-eth-mdio.c
 create mode 100644 drivers/net/ethernet/synopsys/dwc/dwc-eth-net.c
 create mode 100644 drivers/net/ethernet/synopsys/dwc/dwc-eth-ptp.c
 create mode 100644 drivers/net/ethernet/synopsys/dwc/dwc-eth-regacc.h
 create mode 100644 drivers/net/ethernet/synopsys/dwc/dwc-eth.h
 create mode 100644 drivers/net/ethernet/synopsys/dwc/dwc-xlgmac-pci.c
 create mode 100644 drivers/net/ethernet/synopsys/dwc/dwc-xlgmac.c
 create mode 100644 drivers/net/ethernet/synopsys/dwc/dwc-xlgmac.h

-- 
1.9.1



[PATCH net-next 1/2] net: phy: add extension of phy-mode for XLGMII

2016-12-06 Thread Jie Deng
This patch adds phy-mode support for Synopsys XLGMAC

Signed-off-by: Jie Deng 
---
 Documentation/devicetree/bindings/net/ethernet.txt | 1 +
 include/linux/phy.h| 3 +++
 2 files changed, 4 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/ethernet.txt 
b/Documentation/devicetree/bindings/net/ethernet.txt
index 0515095..2378f00 100644
--- a/Documentation/devicetree/bindings/net/ethernet.txt
+++ b/Documentation/devicetree/bindings/net/ethernet.txt
@@ -28,6 +28,7 @@ The following properties are common to the Ethernet 
controllers:
   * "rtbi"
   * "smii"
   * "xgmii"
+  * "xlgmii"
   * "trgmii"
 - phy-connection-type: the same as "phy-mode" property but described in ePAPR;
 - phy-handle: phandle, specifies a reference to a node representing a PHY
diff --git a/include/linux/phy.h b/include/linux/phy.h
index feb8a98..b52f9f8 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -79,6 +79,7 @@
PHY_INTERFACE_MODE_RTBI,
PHY_INTERFACE_MODE_SMII,
PHY_INTERFACE_MODE_XGMII,
+   PHY_INTERFACE_MODE_XLGMII,
PHY_INTERFACE_MODE_MOCA,
PHY_INTERFACE_MODE_QSGMII,
PHY_INTERFACE_MODE_TRGMII,
@@ -136,6 +137,8 @@ static inline const char *phy_modes(phy_interface_t 
interface)
return "smii";
case PHY_INTERFACE_MODE_XGMII:
return "xgmii";
+   case PHY_INTERFACE_MODE_XLGMII:
+   return "xlgmii";
case PHY_INTERFACE_MODE_MOCA:
return "moca";
case PHY_INTERFACE_MODE_QSGMII:
-- 
1.9.1



Re: [PATCH v2] tun: Use netif_receive_skb instead of netif_rx

2016-12-06 Thread Jason Wang



On 2016年12月07日 11:25, David Miller wrote:

From: Jason Wang 
Date: Wed, 7 Dec 2016 11:21:11 +0800


David, looks like this commit is not in net-next.git.

Please help to check.

Take a look, it should be there now.


Yes, thanks.


[PATCH net-next] net: sock_rps_record_flow() is for connected sockets

2016-12-06 Thread Eric Dumazet
From: Eric Dumazet 

Paolo noticed a cache line miss in UDP recvmsg() to access
sk_rxhash, sharing a cache line with sk_drops.

sk_drops might be heavily incremented by cpus handling a flood targeting
this socket.

We might place sk_drops on a separate cache line, but lets try
to avoid wasting 64 bytes per socket just for this, since we have
other bottlenecks to take care of.

sock_rps_record_flow() should only access sk_rxhash for connected
flows.

Testing sk_state for TCP_ESTABLISHED covers most of the cases for
connected sockets, for a zero cost, since system calls using
sock_rps_record_flow() also access sk->sk_prot which is on the
same cache line.

A follow up patch will provide a static_key (Jump Label) since most
hosts do not even use RFS.

Signed-off-by: Eric Dumazet 
Reported-by: Paolo Abeni 
---
 include/net/sock.h |   12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 
6dfe3aa22b970eecfab4d4a0753804b1cc82a200..a7ddab993b496f1f4060f0b41831a161c284df9e
 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -913,7 +913,17 @@ static inline void sock_rps_record_flow_hash(__u32 hash)
 static inline void sock_rps_record_flow(const struct sock *sk)
 {
 #ifdef CONFIG_RPS
-   sock_rps_record_flow_hash(sk->sk_rxhash);
+   /* Reading sk->sk_rxhash might incur an expensive cache line miss.
+*
+* TCP_ESTABLISHED does cover almost all states where RFS
+* might be useful, and is cheaper [1] than testing :
+*  IPv4: inet_sk(sk)->inet_daddr
+*  IPv6: ipv6_addr_any(>sk_v6_daddr)
+* OR   an additional socket flag
+* [1] : sk_state and sk_prot are in the same cache line.
+*/
+   if (sk->sk_state == TCP_ESTABLISHED)
+   sock_rps_record_flow_hash(sk->sk_rxhash);
 #endif
 }
 




Re: [PATCH v2] tun: Use netif_receive_skb instead of netif_rx

2016-12-06 Thread Jason Wang



On 2016年12月02日 03:43, David Miller wrote:

From: Andrey Konovalov 
Date: Thu,  1 Dec 2016 10:34:40 +0100


This patch changes tun.c to call netif_receive_skb instead of netif_rx
when a packet is received (if CONFIG_4KSTACKS is not enabled to avoid
stack exhaustion). The difference between the two is that netif_rx queues
the packet into the backlog, and netif_receive_skb proccesses the packet
in the current context.

This patch is required for syzkaller [1] to collect coverage from packet
receive paths, when a packet being received through tun (syzkaller collects
coverage per process in the process context).

As mentioned by Eric this change also speeds up tun/tap. As measured by
Peter it speeds up his closed-loop single-stream tap/OVS benchmark by
about 23%, from 700k packets/second to 867k packets/second.

A similar patch was introduced back in 2010 [2, 3], but the author found
out that the patch doesn't help with the task he had in mind (for cgroups
to shape network traffic based on the original process) and decided not to
go further with it. The main concern back then was about possible stack
exhaustion with 4K stacks.

[1] https://github.com/google/syzkaller

[2] https://www.spinics.net/lists/netdev/thrd440.html#130570

[3] https://www.spinics.net/lists/netdev/msg130570.html

Signed-off-by: Andrey Konovalov 
---

Changes since v1:
- incorporate Eric's note about speed improvements in commit description
- use netif_receive_skb CONFIG_4KSTACKS is not enabled

Applied to net-next, thanks!


David, looks like this commit is not in net-next.git.

Please help to check.

Thanks


Re: [PATCH v2] tun: Use netif_receive_skb instead of netif_rx

2016-12-06 Thread David Miller
From: Jason Wang 
Date: Wed, 7 Dec 2016 11:21:11 +0800

> David, looks like this commit is not in net-next.git.
> 
> Please help to check.

Take a look, it should be there now.


Re: [PATCH 1/1] ixgbe: write flush vfta registers

2016-12-06 Thread zhuyj
After several week tests, your advice still make this bug appear. But
my patch make this bug disappear.

Zhu Yanjun

On Thu, Nov 17, 2016 at 5:33 PM, zhuyj  wrote:
> Sure. From the following.
> "
> VLAN Filter. Each bit ‘i’ in register ‘n’ affects packets with VLAN
> tags equal to 32*n+i.
> 128 VLAN Filter registers compose a table of 4096 bits that cover all
> possible VLAN
> tags.
> Each bit when set, enables packets with the associated VLAN tags to
> pass. Each bit
> when cleared, blocks packets with this VLAN tag.
> "
> Your suggestions seems reasonable. Please wait. I will make tests to
> vefiry your suggestions.
>
> I will keep you update.
>
> On Wed, Nov 16, 2016 at 10:05 PM, Lino Sanfilippo  
> wrote:
>>
>>
>> Hi,
>>
>>>
>>> Sometimes vfta registers can not be written successfully in dcb mode.
>>> This is very occassional. When the ixgbe nic runs for a very long time,
>>> sometimes this bug occurs. But after IXGBE_WRITE_FLUSH is executed,
>>> this bug never occurs.
>>>
>>> Signed-off-by: Zhu Yanjun 
>>> ---
>>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 5 -
>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
>>> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>> index bd93d82..1221cfb 100644
>>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>> @@ -4138,8 +4138,10 @@ static void ixgbe_vlan_promisc_enable(struct 
>>> ixgbe_adapter *adapter)
>>>   }
>>>
>>>   /* Set all bits in the VLAN filter table array */
>>> - for (i = hw->mac.vft_size; i--;)
>>> + for (i = hw->mac.vft_size; i--;) {
>>>   IXGBE_WRITE_REG(hw, IXGBE_VFTA(i), ~0U);
>>> + IXGBE_WRITE_FLUSH(hw);
>>> + }
>>
>> Should it not be sufficient to do the flush only once, at the end of the 
>> function?
>>
>> Regards,
>> Lino


Re: [PATCH net-next V2 6/7] liquidio CN23XX: VF TX buffers

2016-12-06 Thread David Miller
From: Raghu Vatsavayi 
Date: Tue, 6 Dec 2016 13:06:06 -0800

> diff --git a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c 
> b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
> index cf80722..ce5cdcd 100644
> --- a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
> +++ b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
> @@ -270,6 +270,19 @@ static void start_txq(struct net_device *netdev)
>  }
>  
>  /**
> + * \brief Wake a queue
> + * @param netdev network device
> + * @param q which queue to wake
> + */
> +static inline void wake_q(struct net_device *netdev, int q)
 ...
> +static inline int skb_iq(struct lio *lio, struct sk_buff *skb)
 ...
> +static inline int check_txq_state(struct lio *lio, struct sk_buff *skb)

Please do not mark functions inline in foo.c files, let the compiler
decide.


Re: [PATCH net-next 0/2] Add ethtool set regs support

2016-12-06 Thread David Miller
From: Andrew Lunn 
Date: Wed, 7 Dec 2016 03:41:43 +0100

> On Wed, Dec 07, 2016 at 12:33:08AM +0200, Saeed Mahameed wrote:
>> Hi Dave,
>> 
>> This series adds the support for setting device registers from user
>> space ethtool.
> 
> Is this not the start of allowing binary only drivers in user space?
> 
> Do we want this?

I don't think we do.

> 
>> mlx5 driver have registers allowed access list and will check the user 
>> Request validity before forwarding it to HW registers. Mlx5 will allow only 
>> mlx5 specific
>> configurations to be set (e.g. Device Diag Counters for HW performance 
>> debugging and analysis)
>> which has no standard API to access it.
> 
> Would it not be better to define an flexible API to do this? We have
> lots of HW performance counters for CPUs. Why is it not possible to do
> this for a network device?

So if this isn't for raw PIO register access, then we should define
an appropriate interface for it.

The ethtool regs stuff is untyped, and arbitrary.

Please create something properly structured, and typed, which would
allow accessing the information you want the user to be able to
access.

That way the kernel can tell what the user is reading or writing,
and thus properly control access.


Re: [PATCH] [v3] net: phy: phy drivers should not set SUPPORTED_[Asym_]Pause

2016-12-06 Thread Timur Tabi

Florian Fainelli wrote:

which is why this made me think the & (SUPPORTED_Pause |
SUPPPORTED_Asym_Pause) here is most likely redundant?


Well, like I said, better safe than sorry.  I'd rather keep the &= 
unless you have a strong objection.


--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, hosted by The Linux Foundation.


Re: [PATCH v3 0/7] irda: w83977af_ir: Neatening

2016-12-06 Thread David Miller
From: Joe Perches 
Date: Tue,  6 Dec 2016 10:15:59 -0800

> Originally on top of Arnd's overly long udelay patches because I
> noticed a misindented block.  That's now already fixed along with some
> other whitespace problems.  These patches are the remainder style
> issues from my original series.
> 
> Even though I haven't turned on the netwinder in a box in the
> garage in who knows how long, if this device is still used somewhere,
> might as well neaten the code too.

Series applied.


Re: [PATCH net-next 0/2] Add ethtool set regs support

2016-12-06 Thread Andrew Lunn
On Wed, Dec 07, 2016 at 12:33:08AM +0200, Saeed Mahameed wrote:
> Hi Dave,
> 
> This series adds the support for setting device registers from user
> space ethtool.

Is this not the start of allowing binary only drivers in user space?

Do we want this?

> mlx5 driver have registers allowed access list and will check the user 
> Request validity before forwarding it to HW registers. Mlx5 will allow only 
> mlx5 specific
> configurations to be set (e.g. Device Diag Counters for HW performance 
> debugging and analysis)
> which has no standard API to access it.

Would it not be better to define an flexible API to do this? We have
lots of HW performance counters for CPUs. Why is it not possible to do
this for a network device?

  Andrew


Re: [PATCH] [v3] net: phy: phy drivers should not set SUPPORTED_[Asym_]Pause

2016-12-06 Thread Timur Tabi

Florian Fainelli wrote:

+   if (phydrv->features & (SUPPORTED_Pause | SUPPORTED_Asym_Pause)) {
>+   phydev->supported &= ~(SUPPORTED_Pause | SUPPORTED_Asym_Pause);
>+   phydev->supported |= phydrv->features &
>+(SUPPORTED_Pause | SUPPORTED_Asym_Pause);

Is not the & (SUPPORTED_Pause | SUPPORTED_Asym_Pause) redundant here anyway?


I'm just trying to be safe.  Can I be certain that those bits are 
already zero?





>+   } else {
>+   phydev->supported |= SUPPORTED_Pause | SUPPORTED_Asym_Pause;

that part looks good.


>+   }
>+
>+   phydev->supported |= SUPPORTED_Pause | SUPPORTED_Asym_Pause;

but this one basically "undoes" what the if () clause did where we
checked if either, or one of the two bits was already set?


Ugh, sorry.  I thought I deleted that before sending the patch out. 
I'll send out a v4 tomorrow.


--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, hosted by The Linux Foundation.


Re: [PATCH] [v3] net: phy: phy drivers should not set SUPPORTED_[Asym_]Pause

2016-12-06 Thread Florian Fainelli
On 12/06/2016 05:50 PM, Timur Tabi wrote:
> Florian Fainelli wrote:
>>> +if (phydrv->features & (SUPPORTED_Pause | SUPPORTED_Asym_Pause)) {
>>> >+phydev->supported &= ~(SUPPORTED_Pause |
>>> SUPPORTED_Asym_Pause);
>>> >+phydev->supported |= phydrv->features &
>>> >+ (SUPPORTED_Pause | SUPPORTED_Asym_Pause);
>> Is not the & (SUPPORTED_Pause | SUPPORTED_Asym_Pause) redundant here
>> anyway?
> 
> I'm just trying to be safe.  Can I be certain that those bits are
> already zero?

The bits are most likely not zero, since we have all the
PHY_GBIT_FEATURES bits defined in there as well, but I don't think that
is a real problem though, because we did this before:

/* Start out supporting everything. Eventually,
 * a controller will attach, and may modify one
 * or both of these values
 */
phydev->supported = phydrv->features;
of_set_phy_supported(phydev);
phydev->advertising = phydev->supported;

which is why this made me think the & (SUPPORTED_Pause |
SUPPPORTED_Asym_Pause) here is most likely redundant?

Thanks!
-- 
Florian


[PATCH net-next v3 1/1] driver: ipvlan: Free ipvl_port directly with kfree instead of kfree_rcu

2016-12-06 Thread fgao
From: Gao Feng 

There are two functions which would free the ipvl_port now. The first
is ipvlan_port_create. It frees the ipvl_port in the error handler,
so it could kfree it directly. The second is ipvlan_port_destroy. It
invokes netdev_rx_handler_unregister which enforces one grace period
by synchronize_net firstly, so it also could kfree the ipvl_port
directly and safely.

So it is unnecessary to use kfree_rcu to free ipvl_port.

Signed-off-by: Gao Feng 
---
 v3: Add more detail comments
 v2: Remove the rcu of ipvl_port directly
 v1: Initial patch

 drivers/net/ipvlan/ipvlan.h  | 1 -
 drivers/net/ipvlan/ipvlan_main.c | 4 ++--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h
index 05a62d2..031093e 100644
--- a/drivers/net/ipvlan/ipvlan.h
+++ b/drivers/net/ipvlan/ipvlan.h
@@ -97,7 +97,6 @@ struct ipvl_port {
struct work_struct  wq;
struct sk_buff_head backlog;
int count;
-   struct rcu_head rcu;
 };
 
 static inline struct ipvl_port *ipvlan_port_get_rcu(const struct net_device *d)
diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index c6aa667..44ceebc 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -128,7 +128,7 @@ static int ipvlan_port_create(struct net_device *dev)
return 0;
 
 err:
-   kfree_rcu(port, rcu);
+   kfree(port);
return err;
 }
 
@@ -145,7 +145,7 @@ static void ipvlan_port_destroy(struct net_device *dev)
netdev_rx_handler_unregister(dev);
cancel_work_sync(>wq);
__skb_queue_purge(>backlog);
-   kfree_rcu(port, rcu);
+   kfree(port);
 }
 
 #define IPVLAN_FEATURES \
-- 
1.9.1




Re: [PATCH] [v3] net: phy: phy drivers should not set SUPPORTED_[Asym_]Pause

2016-12-06 Thread Florian Fainelli
On 12/06/2016 04:27 PM, Timur Tabi wrote:
> Instead of having individual PHY drivers set the SUPPORTED_Pause and
> SUPPORTED_Asym_Pause flags, phylib itself should set those flags,
> unless there is a hardware erratum or other special case.  During
> autonegotiation, the PHYs will determine whether to enable pause
> frame support.
> 
> Pause frames are a feature that is supported by the MAC.  It is the MAC
> that generates the frames and that processes them.  The PHY can only be
> configured to allow them to pass through.
> 
> So the new process is:
> 
> 1) Unless the PHY driver overrides it, phylib sets the SUPPORTED_Pause
> and SUPPORTED_AsymPause bits in phydev->supported.  This indicates that
> the PHY supports pause frames.
> 
> 2) The MAC driver checks phydev->supported before it calls phy_start().
> If (SUPPORTED_Pause | SUPPORTED_AsymPause) is set, then the MAC driver
> sets those bits in phydev->advertising, if it wants to enable pause
> frame support.
> 
> 3) When the link state changes, the MAC driver checks phydev->pause and
> phydev->asym_pause,  If the bits are set, then it enables the corresponding
> features in the MAC.  The algorithm is:
> 
>   if (phydev->pause)
>   The MAC should be programmed to receive and honor
> pause frames it receives, i.e. enable receive flow control.
> 
>   if (phydev->pause != phydev->asym_pause)
>   The MAC should be programmed to transmit pause
>   frames when needed, i.e. enable transmit flow control.
> 
> Signed-off-by: Timur Tabi 
> ---

> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index 49a1c98..fe36eeb 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -1598,6 +1598,27 @@ static int phy_probe(struct device *dev)
>   of_set_phy_supported(phydev);
>   phydev->advertising = phydev->supported;
>  
> + /* The Pause Frame bits indicate that the PHY can support passing
> +  * pause frames. During autonegotiation, the PHYs will determine if
> +  * they should allow pause frames to pass.  The MAC driver should then
> +  * use that result to determine whether to enable flow control via
> +  * pause frames.
> +  *
> +  * Normally, PHY drivers should not set the Pause bits, and instead
> +  * allow phylib to do that.  However, there may be some situations
> +  * (e.g. hardware erratum) where the driver wants to set only one
> +  * of these bits.
> +  */
> + if (phydrv->features & (SUPPORTED_Pause | SUPPORTED_Asym_Pause)) {
> + phydev->supported &= ~(SUPPORTED_Pause | SUPPORTED_Asym_Pause);
> + phydev->supported |= phydrv->features &
> +  (SUPPORTED_Pause | SUPPORTED_Asym_Pause);

Is not the & (SUPPORTED_Pause | SUPPORTED_Asym_Pause) redundant here anyway?

> + } else {
> + phydev->supported |= SUPPORTED_Pause | SUPPORTED_Asym_Pause;

that part looks good.

> + }
> +
> + phydev->supported |= SUPPORTED_Pause | SUPPORTED_Asym_Pause;

but this one basically "undoes" what the if () clause did where we
checked if either, or one of the two bits was already set?
-- 
Florian


Re: [PATCH net-next v2 1/1] driver: ipvlan: Free ipvl_port directly with kfree instead of kfree_rcu

2016-12-06 Thread Gao Feng
Hi Eric,

On Tue, Dec 6, 2016 at 11:18 PM, Eric Dumazet  wrote:
> On Tue, 2016-12-06 at 21:54 +0800, f...@ikuai8.com wrote:
>> From: Gao Feng 
>>
>> There is no one which may reference the ipvlan port when free it in
>> ipvlan_port_create and ipvlan_port_destroy. So it is unnecessary to
>> use kfree_rcu.
>
> You did not really explain _why_ it was safe/unnecessary.
> Why should anyone trust you ?

Thanks your point.
I found the reason yesterday after receive your suggestion and reply
the last v1 email.
Then I send the v2 patch.
I assume the reviewer would know more than me, so I didn't add more details.

I will add more details in v3 patch.

>
> The reason an RCU grace period is not needed is that
> netdev_rx_handler_unregister() already enforces a grace period.
>
> My guess is ipvlan copied code in macvlan.
>
> At the time macvlan was written, commit
> 00cfec37484761a44 ("net: add a synchronize_net() in
> netdev_rx_handler_unregister()") was not there yet.
>
> macvlan could be changed the same way.

Yes. After I find the netdev_rx_handler_unregister which enforces one
grace period.
I prepare to check other codes.

Best Regards
Feng

>
>
>




[PATCH] [v3] net: phy: phy drivers should not set SUPPORTED_[Asym_]Pause

2016-12-06 Thread Timur Tabi
Instead of having individual PHY drivers set the SUPPORTED_Pause and
SUPPORTED_Asym_Pause flags, phylib itself should set those flags,
unless there is a hardware erratum or other special case.  During
autonegotiation, the PHYs will determine whether to enable pause
frame support.

Pause frames are a feature that is supported by the MAC.  It is the MAC
that generates the frames and that processes them.  The PHY can only be
configured to allow them to pass through.

So the new process is:

1) Unless the PHY driver overrides it, phylib sets the SUPPORTED_Pause
and SUPPORTED_AsymPause bits in phydev->supported.  This indicates that
the PHY supports pause frames.

2) The MAC driver checks phydev->supported before it calls phy_start().
If (SUPPORTED_Pause | SUPPORTED_AsymPause) is set, then the MAC driver
sets those bits in phydev->advertising, if it wants to enable pause
frame support.

3) When the link state changes, the MAC driver checks phydev->pause and
phydev->asym_pause,  If the bits are set, then it enables the corresponding
features in the MAC.  The algorithm is:

if (phydev->pause)
The MAC should be programmed to receive and honor
pause frames it receives, i.e. enable receive flow control.

if (phydev->pause != phydev->asym_pause)
The MAC should be programmed to transmit pause
frames when needed, i.e. enable transmit flow control.

Signed-off-by: Timur Tabi 
---
 drivers/net/phy/bcm-cygnus.c |  3 +--
 drivers/net/phy/bcm7xxx.c|  6 ++
 drivers/net/phy/broadcom.c   | 42 ++
 drivers/net/phy/icplus.c |  6 ++
 drivers/net/phy/intel-xway.c | 24 
 drivers/net/phy/micrel.c | 30 --
 drivers/net/phy/microchip.c  |  3 +--
 drivers/net/phy/national.c   |  2 +-
 drivers/net/phy/phy_device.c | 21 +
 drivers/net/phy/smsc.c   | 18 ++
 10 files changed, 68 insertions(+), 87 deletions(-)

diff --git a/drivers/net/phy/bcm-cygnus.c b/drivers/net/phy/bcm-cygnus.c
index 49bbc68..12c6d49 100644
--- a/drivers/net/phy/bcm-cygnus.c
+++ b/drivers/net/phy/bcm-cygnus.c
@@ -134,8 +134,7 @@ static int bcm_cygnus_resume(struct phy_device *phydev)
.phy_id= PHY_ID_BCM_CYGNUS,
.phy_id_mask   = 0xfff0,
.name  = "Broadcom Cygnus PHY",
-   .features  = PHY_GBIT_FEATURES |
-   SUPPORTED_Pause | SUPPORTED_Asym_Pause,
+   .features  = PHY_GBIT_FEATURES,
.config_init   = bcm_cygnus_config_init,
.config_aneg   = genphy_config_aneg,
.read_status   = genphy_read_status,
diff --git a/drivers/net/phy/bcm7xxx.c b/drivers/net/phy/bcm7xxx.c
index 9636da0..c1629df 100644
--- a/drivers/net/phy/bcm7xxx.c
+++ b/drivers/net/phy/bcm7xxx.c
@@ -308,8 +308,7 @@ static int bcm7xxx_suspend(struct phy_device *phydev)
.phy_id = (_oui),   \
.phy_id_mask= 0xfff0,   \
.name   = _name,\
-   .features   = PHY_GBIT_FEATURES |   \
- SUPPORTED_Pause | SUPPORTED_Asym_Pause,   \
+   .features   = PHY_GBIT_FEATURES,\
.flags  = PHY_IS_INTERNAL,  \
.config_init= bcm7xxx_28nm_config_init, \
.config_aneg= genphy_config_aneg,   \
@@ -322,8 +321,7 @@ static int bcm7xxx_suspend(struct phy_device *phydev)
.phy_id = (_oui),   \
.phy_id_mask= 0xfff0,   \
.name   = _name,\
-   .features   = PHY_BASIC_FEATURES |  \
- SUPPORTED_Pause | SUPPORTED_Asym_Pause,   \
+   .features   = PHY_BASIC_FEATURES,   \
.flags  = PHY_IS_INTERNAL,  \
.config_init= bcm7xxx_config_init,  \
.config_aneg= genphy_config_aneg,   \
diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
index b1e32e9..1c990c8 100644
--- a/drivers/net/phy/broadcom.c
+++ b/drivers/net/phy/broadcom.c
@@ -540,8 +540,7 @@ static int brcm_fet_config_intr(struct phy_device *phydev)
.phy_id = PHY_ID_BCM5411,
.phy_id_mask= 0xfff0,
.name   = "Broadcom BCM5411",
-   .features   = PHY_GBIT_FEATURES |
- SUPPORTED_Pause | SUPPORTED_Asym_Pause,
+   .features   = PHY_GBIT_FEATURES,
.flags  = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT,
.config_init= 

[PATCH net-next] bpf: fix loading of BPF_MAXINSNS sized programs

2016-12-06 Thread Daniel Borkmann
General assumption is that single program can hold up to BPF_MAXINSNS,
that is, 4096 number of instructions. It is the case with cBPF and
that limit was carried over to eBPF. When recently testing digest, I
noticed that it's actually not possible to feed 4096 instructions
via bpf(2).

The check for > BPF_MAXINSNS was added back then to bpf_check() in
cbd357008604 ("bpf: verifier (add ability to receive verification log)").
However, 09756af46893 ("bpf: expand BPF syscall with program load/unload")
added yet another check that comes before that into bpf_prog_load(),
but this time bails out already in case of >= BPF_MAXINSNS.

Fix it up and perform the check early in bpf_prog_load(), so we can drop
the second one in bpf_check(). It makes sense, because also a 0 insn
program is useless and we don't want to waste any resources doing work
up to bpf_check() point. The existing bpf(2) man page documents E2BIG
as the official error for such cases, so just stick with it as well.

Fixes: 09756af46893 ("bpf: expand BPF syscall with program load/unload")
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 ( net-next is just fine imho. )

 kernel/bpf/syscall.c  | 4 ++--
 kernel/bpf/verifier.c | 3 ---
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c0d2b42..88f609f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -786,8 +786,8 @@ static int bpf_prog_load(union bpf_attr *attr)
/* eBPF programs must be GPL compatible to use GPL-ed functions */
is_gpl = license_is_gpl_compatible(license);
 
-   if (attr->insn_cnt >= BPF_MAXINSNS)
-   return -EINVAL;
+   if (attr->insn_cnt == 0 || attr->insn_cnt > BPF_MAXINSNS)
+   return -E2BIG;
 
if (type == BPF_PROG_TYPE_KPROBE &&
attr->kern_version != LINUX_VERSION_CODE)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index cb37339..da9fb2a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3133,9 +3133,6 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
*attr)
struct bpf_verifier_env *env;
int ret = -EINVAL;
 
-   if ((*prog)->len <= 0 || (*prog)->len > BPF_MAXINSNS)
-   return -E2BIG;
-
/* 'struct bpf_verifier_env' can be global, but since it's not small,
 * allocate/free it every time bpf_check() is called
 */
-- 
1.9.3



Re: [PATCH]net:sched:release lock before tcf_dump_walker() normal return to avoid deadlock

2016-12-06 Thread Cong Wang
On Tue, Dec 6, 2016 at 5:50 AM, Jamal Hadi Salim  wrote:
> On 16-12-06 12:36 AM, Feng Deng wrote:
>>
>> From: Feng Deng
>>
>> release lock before tcf_dump_walker() normal return to avoid deadlock
>>
>
> /Scratching my head.
>
> I am probably missing something obvious.
> What are the condition under which this deadlock will happen?
> Do you have a testcase we can try?

I don't even see a patch (tried Google too).


[PATCH v5 06/13] net: ethernet: ti: cpts: disable cpts when unregistered

2016-12-06 Thread Grygorii Strashko
The cpts now is left enabled after unregistration.
Hence, disable it in cpts_unregister().

Signed-off-by: Grygorii Strashko 
Acked-by: Richard Cochran 
---
 drivers/net/ethernet/ti/cpts.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
index 3dda6d5..d3c1ac5 100644
--- a/drivers/net/ethernet/ti/cpts.c
+++ b/drivers/net/ethernet/ti/cpts.c
@@ -404,6 +404,10 @@ void cpts_unregister(struct cpts *cpts)
ptp_clock_unregister(cpts->clock);
cancel_delayed_work_sync(>overflow_work);
}
+
+   cpts_write32(cpts, 0, int_enable);
+   cpts_write32(cpts, 0, control);
+
if (cpts->refclk)
cpts_clk_release(cpts);
 }
-- 
2.10.1



[PATCH v5 03/13] net: ethernet: ti: cpsw: minimize direct access to struct cpts

2016-12-06 Thread Grygorii Strashko
This will provide more flexibility in changing CPTS internals and also
required for further changes.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/cpsw.c | 28 +++-
 drivers/net/ethernet/ti/cpts.h | 39 +++
 2 files changed, 54 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 8fdb274..7599895 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -1562,7 +1562,7 @@ static netdev_tx_t cpsw_ndo_start_xmit(struct sk_buff 
*skb,
}
 
if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP &&
-   cpsw->cpts->tx_enable)
+   cpts_is_tx_enabled(cpsw->cpts))
skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
 
skb_tx_timestamp(skb);
@@ -1601,7 +1601,8 @@ static void cpsw_hwtstamp_v1(struct cpsw_common *cpsw)
struct cpsw_slave *slave = >slaves[cpsw->data.active_slave];
u32 ts_en, seq_id;
 
-   if (!cpsw->cpts->tx_enable && !cpsw->cpts->rx_enable) {
+   if (!cpts_is_tx_enabled(cpsw->cpts) &&
+   !cpts_is_rx_enabled(cpsw->cpts)) {
slave_write(slave, 0, CPSW1_TS_CTL);
return;
}
@@ -1609,10 +1610,10 @@ static void cpsw_hwtstamp_v1(struct cpsw_common *cpsw)
seq_id = (30 << CPSW_V1_SEQ_ID_OFS_SHIFT) | ETH_P_1588;
ts_en = EVENT_MSG_BITS << CPSW_V1_MSG_TYPE_OFS;
 
-   if (cpsw->cpts->tx_enable)
+   if (cpts_is_tx_enabled(cpsw->cpts))
ts_en |= CPSW_V1_TS_TX_EN;
 
-   if (cpsw->cpts->rx_enable)
+   if (cpts_is_rx_enabled(cpsw->cpts))
ts_en |= CPSW_V1_TS_RX_EN;
 
slave_write(slave, ts_en, CPSW1_TS_CTL);
@@ -1635,20 +1636,20 @@ static void cpsw_hwtstamp_v2(struct cpsw_priv *priv)
case CPSW_VERSION_2:
ctrl &= ~CTRL_V2_ALL_TS_MASK;
 
-   if (cpsw->cpts->tx_enable)
+   if (cpts_is_tx_enabled(cpsw->cpts))
ctrl |= CTRL_V2_TX_TS_BITS;
 
-   if (cpsw->cpts->rx_enable)
+   if (cpts_is_rx_enabled(cpsw->cpts))
ctrl |= CTRL_V2_RX_TS_BITS;
break;
case CPSW_VERSION_3:
default:
ctrl &= ~CTRL_V3_ALL_TS_MASK;
 
-   if (cpsw->cpts->tx_enable)
+   if (cpts_is_tx_enabled(cpsw->cpts))
ctrl |= CTRL_V3_TX_TS_BITS;
 
-   if (cpsw->cpts->rx_enable)
+   if (cpts_is_rx_enabled(cpsw->cpts))
ctrl |= CTRL_V3_RX_TS_BITS;
break;
}
@@ -1684,7 +1685,7 @@ static int cpsw_hwtstamp_set(struct net_device *dev, 
struct ifreq *ifr)
 
switch (cfg.rx_filter) {
case HWTSTAMP_FILTER_NONE:
-   cpts->rx_enable = 0;
+   cpts_rx_enable(cpts, 0);
break;
case HWTSTAMP_FILTER_ALL:
case HWTSTAMP_FILTER_PTP_V1_L4_EVENT:
@@ -1700,14 +1701,14 @@ static int cpsw_hwtstamp_set(struct net_device *dev, 
struct ifreq *ifr)
case HWTSTAMP_FILTER_PTP_V2_EVENT:
case HWTSTAMP_FILTER_PTP_V2_SYNC:
case HWTSTAMP_FILTER_PTP_V2_DELAY_REQ:
-   cpts->rx_enable = 1;
+   cpts_rx_enable(cpts, 1);
cfg.rx_filter = HWTSTAMP_FILTER_PTP_V2_EVENT;
break;
default:
return -ERANGE;
}
 
-   cpts->tx_enable = cfg.tx_type == HWTSTAMP_TX_ON;
+   cpts_tx_enable(cpts, cfg.tx_type == HWTSTAMP_TX_ON);
 
switch (cpsw->version) {
case CPSW_VERSION_1:
@@ -1736,8 +1737,9 @@ static int cpsw_hwtstamp_get(struct net_device *dev, 
struct ifreq *ifr)
return -EOPNOTSUPP;
 
cfg.flags = 0;
-   cfg.tx_type = cpts->tx_enable ? HWTSTAMP_TX_ON : HWTSTAMP_TX_OFF;
-   cfg.rx_filter = (cpts->rx_enable ?
+   cfg.tx_type = cpts_is_tx_enabled(cpts) ?
+ HWTSTAMP_TX_ON : HWTSTAMP_TX_OFF;
+   cfg.rx_filter = (cpts_is_rx_enabled(cpts) ?
 HWTSTAMP_FILTER_PTP_V2_EVENT : HWTSTAMP_FILTER_NONE);
 
return copy_to_user(ifr->ifr_data, , sizeof(cfg)) ? -EFAULT : 0;
diff --git a/drivers/net/ethernet/ti/cpts.h b/drivers/net/ethernet/ti/cpts.h
index 416ba2c..29a1e80c 100644
--- a/drivers/net/ethernet/ti/cpts.h
+++ b/drivers/net/ethernet/ti/cpts.h
@@ -132,6 +132,27 @@ void cpts_rx_timestamp(struct cpts *cpts, struct sk_buff 
*skb);
 void cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb);
 int cpts_register(struct device *dev, struct cpts *cpts, u32 mult, u32 shift);
 void cpts_unregister(struct cpts *cpts);
+
+static inline void cpts_rx_enable(struct cpts *cpts, int enable)
+{
+   cpts->rx_enable = enable;
+}
+
+static inline bool cpts_is_rx_enabled(struct cpts *cpts)
+{
+   return !!cpts->rx_enable;
+}
+
+static inline void cpts_tx_enable(struct cpts *cpts, int 

[PATCH v5 00/13] net: ethernet: ti: cpts: update and fixes

2016-12-06 Thread Grygorii Strashko
It is preparation series intended to clean up and optimize TI CPTS driver to
facilitate further integration with other TI's SoCs like Keystone 2.

Changes in v5:
- fixed copy paste error in cpts_release
- reworked cc.mult/shift and cc_mult initialization 

Changes in v4:
- fixed build error in patch
  "net: ethernet: ti: cpts: clean up event list if event pool is empty"
- rebased on top of net-next
 
Changes in v3:
- patches reordered: fixes and small updates moved first
- added comments in code about cpts->cc_mult
- conversation range (maxsec) limited to 10sec

Changes in v2:
- patch "net: ethernet: ti: cpts: rework initialization/deinitialization"
  was split on 4 patches
- applied comments from Richard Cochran
- dropped patch
  "net: ethernet: ti: cpts: add return value to tx and rx timestamp funcitons"
- new patches added:
  "net: ethernet: ti: cpts: drop excessive writes to CTRL and INT_EN regs"
  and "clocksource: export the clocks_calc_mult_shift to use by timestamp code"

Links on prev versions:
v4: https://lkml.org/lkml/2016/12/6/496
v3: https://www.spinics.net/lists/devicetree/msg153474.html
v2: http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1282034.html
v1: http://www.spinics.net/lists/linux-omap/msg131925.html

Grygorii Strashko (11):
  net: ethernet: ti: cpts: switch to readl/writel_relaxed()
  net: ethernet: ti: allow cpts to be built separately
  net: ethernet: ti: cpsw: minimize direct access to struct cpts
  net: ethernet: ti: cpts: fix unbalanced clk api usage in 
cpts_register/unregister
  net: ethernet: ti: cpts: fix registration order
  net: ethernet: ti: cpts: disable cpts when unregistered
  net: ethernet: ti: cpts: drop excessive writes to CTRL and INT_EN regs
  net: ethernet: ti: cpts: rework initialization/deinitialization
  net: ethernet: ti: cpts: move dt props parsing to cpts driver
  net: ethernet: ti: cpts: calc mult and shift from refclk freq
  net: ethernet: ti: cpts: fix overflow check period

Murali Karicheri (1):
  clocksource: export the clocks_calc_mult_shift to use by timestamp code

WingMan Kwok (1):
  net: ethernet: ti: cpts: clean up event list if event pool is empty

 Documentation/devicetree/bindings/net/cpsw.txt |   8 +-
 drivers/net/ethernet/ti/Kconfig|   2 +-
 drivers/net/ethernet/ti/Makefile   |   3 +-
 drivers/net/ethernet/ti/cpsw.c |  84 -
 drivers/net/ethernet/ti/cpsw.h |   2 -
 drivers/net/ethernet/ti/cpts.c | 233 ++---
 drivers/net/ethernet/ti/cpts.h |  80 -
 kernel/time/clocksource.c  |   1 +
 8 files changed, 297 insertions(+), 116 deletions(-)

-- 
2.10.1



[PATCH v5 02/13] net: ethernet: ti: allow cpts to be built separately

2016-12-06 Thread Grygorii Strashko
TI CPTS IP is used as part of TI OMAP CPSW driver, but it's also
present as part of NETCP on TI Keystone 2 SoCs. So, It's required
to enable build of CPTS for both this drivers and this can be
achieved by allowing CPTS to be built separately.

Hence, allow cpts to be built separately and convert it to be
a module as both CPSW and NETCP drives can be built as modules.

Signed-off-by: Grygorii Strashko 
---
 drivers/net/ethernet/ti/Kconfig  |  2 +-
 drivers/net/ethernet/ti/Makefile |  3 ++-
 drivers/net/ethernet/ti/cpsw.c   | 22 +-
 drivers/net/ethernet/ti/cpts.c   | 16 
 drivers/net/ethernet/ti/cpts.h   | 18 ++
 5 files changed, 42 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/ti/Kconfig b/drivers/net/ethernet/ti/Kconfig
index 9904d74..ff7f518 100644
--- a/drivers/net/ethernet/ti/Kconfig
+++ b/drivers/net/ethernet/ti/Kconfig
@@ -74,7 +74,7 @@ config TI_CPSW
  will be called cpsw.
 
 config TI_CPTS
-   bool "TI Common Platform Time Sync (CPTS) Support"
+   tristate "TI Common Platform Time Sync (CPTS) Support"
depends on TI_CPSW
select PTP_1588_CLOCK
---help---
diff --git a/drivers/net/ethernet/ti/Makefile b/drivers/net/ethernet/ti/Makefile
index d420d94..1e7c10b 100644
--- a/drivers/net/ethernet/ti/Makefile
+++ b/drivers/net/ethernet/ti/Makefile
@@ -12,8 +12,9 @@ obj-$(CONFIG_TI_DAVINCI_MDIO) += davinci_mdio.o
 obj-$(CONFIG_TI_DAVINCI_CPDMA) += davinci_cpdma.o
 obj-$(CONFIG_TI_CPSW_PHY_SEL) += cpsw-phy-sel.o
 obj-$(CONFIG_TI_CPSW_ALE) += cpsw_ale.o
+obj-$(CONFIG_TI_CPTS) += cpts.o
 obj-$(CONFIG_TI_CPSW) += ti_cpsw.o
-ti_cpsw-y := cpsw.o cpts.o
+ti_cpsw-y := cpsw.o
 
 obj-$(CONFIG_TI_KEYSTONE_NETCP) += keystone_netcp.o
 keystone_netcp-y := netcp_core.o
diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index f373a4b..8fdb274 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -1594,7 +1594,7 @@ static netdev_tx_t cpsw_ndo_start_xmit(struct sk_buff 
*skb,
return NETDEV_TX_BUSY;
 }
 
-#ifdef CONFIG_TI_CPTS
+#if IS_ENABLED(CONFIG_TI_CPTS)
 
 static void cpsw_hwtstamp_v1(struct cpsw_common *cpsw)
 {
@@ -1742,7 +1742,16 @@ static int cpsw_hwtstamp_get(struct net_device *dev, 
struct ifreq *ifr)
 
return copy_to_user(ifr->ifr_data, , sizeof(cfg)) ? -EFAULT : 0;
 }
+#else
+static int cpsw_hwtstamp_get(struct net_device *dev, struct ifreq *ifr)
+{
+   return -EOPNOTSUPP;
+}
 
+static int cpsw_hwtstamp_set(struct net_device *dev, struct ifreq *ifr)
+{
+   return -EOPNOTSUPP;
+}
 #endif /*CONFIG_TI_CPTS*/
 
 static int cpsw_ndo_ioctl(struct net_device *dev, struct ifreq *req, int cmd)
@@ -1755,12 +1764,10 @@ static int cpsw_ndo_ioctl(struct net_device *dev, 
struct ifreq *req, int cmd)
return -EINVAL;
 
switch (cmd) {
-#ifdef CONFIG_TI_CPTS
case SIOCSHWTSTAMP:
return cpsw_hwtstamp_set(dev, req);
case SIOCGHWTSTAMP:
return cpsw_hwtstamp_get(dev, req);
-#endif
}
 
if (!cpsw->slaves[slave_no].phy)
@@ -2100,10 +2107,10 @@ static void cpsw_set_msglevel(struct net_device *ndev, 
u32 value)
priv->msg_enable = value;
 }
 
+#if IS_ENABLED(CONFIG_TI_CPTS)
 static int cpsw_get_ts_info(struct net_device *ndev,
struct ethtool_ts_info *info)
 {
-#ifdef CONFIG_TI_CPTS
struct cpsw_common *cpsw = ndev_to_cpsw(ndev);
 
info->so_timestamping =
@@ -2120,7 +2127,12 @@ static int cpsw_get_ts_info(struct net_device *ndev,
info->rx_filters =
(1 << HWTSTAMP_FILTER_NONE) |
(1 << HWTSTAMP_FILTER_PTP_V2_EVENT);
+   return 0;
+}
 #else
+static int cpsw_get_ts_info(struct net_device *ndev,
+   struct ethtool_ts_info *info)
+{
info->so_timestamping =
SOF_TIMESTAMPING_TX_SOFTWARE |
SOF_TIMESTAMPING_RX_SOFTWARE |
@@ -2128,9 +2140,9 @@ static int cpsw_get_ts_info(struct net_device *ndev,
info->phc_index = -1;
info->tx_types = 0;
info->rx_filters = 0;
-#endif
return 0;
 }
+#endif
 
 static int cpsw_get_link_ksettings(struct net_device *ndev,
   struct ethtool_link_ksettings *ecmd)
diff --git a/drivers/net/ethernet/ti/cpts.c b/drivers/net/ethernet/ti/cpts.c
index a42c449..8cb0369 100644
--- a/drivers/net/ethernet/ti/cpts.c
+++ b/drivers/net/ethernet/ti/cpts.c
@@ -31,8 +31,6 @@
 
 #include "cpts.h"
 
-#ifdef CONFIG_TI_CPTS
-
 #define cpts_read32(c, r)  readl_relaxed(>reg->r)
 #define cpts_write32(c, v, r)  writel_relaxed(v, >reg->r)
 
@@ -334,6 +332,7 @@ void cpts_rx_timestamp(struct cpts *cpts, struct sk_buff 
*skb)
memset(ssh, 0, sizeof(*ssh));
ssh->hwtstamp = ns_to_ktime(ns);
 }
+EXPORT_SYMBOL_GPL(cpts_rx_timestamp);
 
 void cpts_tx_timestamp(struct cpts *cpts, struct sk_buff *skb)
 {
@@ -349,13 +348,11 @@ 

  1   2   3   >