[net-next 05/17] i40evf: don't give up

2015-10-16 Thread Jeff Kirsher
From: Mitch Williams 

When the VF driver is unable to communicate with the PF, it just gives
up and never tries again. Aside from the obvious character flaw that
this shows, it's also a lousy user experience.

When PF communications fail, wait five seconds, and try again. And
again. Don't give up, little VF driver! Your prince will come!

Change-ID: Ia1378a39879883563b8faffce819f375821f9585
Signed-off-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 1dd5245..ce99790 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -2357,9 +2357,12 @@ err_alloc:
 err:
/* Things went into the weeds, so try again later */
if (++adapter->aq_wait_count > I40EVF_AQ_MAX_ERR) {
-   dev_err(&pdev->dev, "Failed to communicate with PF; giving 
up\n");
+   dev_err(&pdev->dev, "Failed to communicate with PF; waiting 
before retry\n");
adapter->flags |= I40EVF_FLAG_PF_COMMS_FAILED;
-   return; /* do not reschedule */
+   i40evf_shutdown_adminq(hw);
+   adapter->state = __I40EVF_STARTUP;
+   schedule_delayed_work(&adapter->init_task, HZ * 5);
+   return;
}
schedule_delayed_work(&adapter->init_task, HZ);
 }
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next 06/17] i40e/i40evf: refactor IRQ enable function

2015-10-16 Thread Jeff Kirsher
From: Jesse Brandeburg 

This change moves a multi-line register setting into a function
which simplifies reading the flow of the enable function.

This also fixes a bug where the enable function was enabling
the interrupt twice while trying to update the two interrupt
throttle rate thresholds for Rx and Tx.

Change-ID: Ie308f9d0d48540204590cb9d7a5a7b1196f959bb
Signed-off-by: Jesse Brandeburg 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 112 +++---
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 108 ++---
 2 files changed, 128 insertions(+), 92 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 512707c..889d25d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -815,6 +815,8 @@ void i40e_force_wb(struct i40e_vsi *vsi, struct 
i40e_q_vector *q_vector)
  * i40e_set_new_dynamic_itr - Find new ITR level
  * @rc: structure containing ring performance data
  *
+ * Returns true if ITR changed, false if not
+ *
  * Stores a new ITR value based on packets and byte counts during
  * the last interrupt.  The advantage of per interrupt computation
  * is faster updates and more accurate ITR for the current traffic
@@ -823,14 +825,14 @@ void i40e_force_wb(struct i40e_vsi *vsi, struct 
i40e_q_vector *q_vector)
  * testing data as well as attempting to minimize response time
  * while increasing bulk throughput.
  **/
-static void i40e_set_new_dynamic_itr(struct i40e_ring_container *rc)
+static bool i40e_set_new_dynamic_itr(struct i40e_ring_container *rc)
 {
enum i40e_latency_range new_latency_range = rc->latency_range;
u32 new_itr = rc->itr;
int bytes_per_int;
 
if (rc->total_packets == 0 || !rc->itr)
-   return;
+   return false;
 
/* simple throttlerate management
 *   0-10MB/s   lowest (10 ints/s)
@@ -874,11 +876,15 @@ static void i40e_set_new_dynamic_itr(struct 
i40e_ring_container *rc)
break;
}
 
-   if (new_itr != rc->itr)
-   rc->itr = new_itr;
-
rc->total_bytes = 0;
rc->total_packets = 0;
+
+   if (new_itr != rc->itr) {
+   rc->itr = new_itr;
+   return true;
+   }
+
+   return false;
 }
 
 /**
@@ -1747,6 +1753,21 @@ static int i40e_clean_rx_irq_1buf(struct i40e_ring 
*rx_ring, int budget)
return total_rx_packets;
 }
 
+static u32 i40e_buildreg_itr(const int type, const u16 itr)
+{
+   u32 val;
+
+   val = I40E_PFINT_DYN_CTLN_INTENA_MASK |
+ I40E_PFINT_DYN_CTLN_CLEARPBA_MASK |
+ (type << I40E_PFINT_DYN_CTLN_ITR_INDX_SHIFT) |
+ (itr << I40E_PFINT_DYN_CTLN_INTERVAL_SHIFT);
+
+   return val;
+}
+
+/* a small macro to shorten up some long lines */
+#define INTREG I40E_PFINT_DYN_CTLN
+
 /**
  * i40e_update_enable_itr - Update itr and re-enable MSIX interrupt
  * @vsi: the VSI we care about
@@ -1757,54 +1778,53 @@ static inline void i40e_update_enable_itr(struct 
i40e_vsi *vsi,
  struct i40e_q_vector *q_vector)
 {
struct i40e_hw *hw = &vsi->back->hw;
-   u16 old_itr;
+   bool rx = false, tx = false;
+   u32 rxval, txval;
int vector;
-   u32 val;
 
vector = (q_vector->v_idx + vsi->base_vector);
+
+   rxval = txval = i40e_buildreg_itr(I40E_ITR_NONE, 0);
+
if (ITR_IS_DYNAMIC(vsi->rx_itr_setting)) {
-   old_itr = q_vector->rx.itr;
-   i40e_set_new_dynamic_itr(&q_vector->rx);
-   if (old_itr != q_vector->rx.itr) {
-   val = I40E_PFINT_DYN_CTLN_INTENA_MASK |
-   I40E_PFINT_DYN_CTLN_CLEARPBA_MASK |
-   (I40E_RX_ITR <<
-   I40E_PFINT_DYN_CTLN_ITR_INDX_SHIFT) |
-   (q_vector->rx.itr <<
-   I40E_PFINT_DYN_CTLN_INTERVAL_SHIFT);
-   } else {
-   val = I40E_PFINT_DYN_CTLN_INTENA_MASK |
-   I40E_PFINT_DYN_CTLN_CLEARPBA_MASK |
-   (I40E_ITR_NONE <<
-   I40E_PFINT_DYN_CTLN_ITR_INDX_SHIFT);
-   }
-   if (!test_bit(__I40E_DOWN, &vsi->state))
-   wr32(hw, I40E_PFINT_DYN_CTLN(vector - 1), val);
-   } else {
-   i40e_irq_dynamic_enable(vsi, q_vector->v_idx);
+   rx = i40e_set_new_dynamic_itr(&q_vector->rx);
+   rxval = i40e_buildreg_itr(I40E_RX_ITR, q_vector->rx.itr);
}
+
if (ITR_IS_DYNAMIC(vsi->tx_itr_setting)) {
-   old_itr = q_vector->tx.itr;
-   i40e_set_new_dynamic_itr(&q_vector->tx);
-   if (old_itr != q_vector->tx.itr) {
-   val = I40E_PFINT_DYN_CTLN_INTENA_MASK |
- 

[net-next 07/17] i40e/i40evf: fix bug in throttle rate math

2015-10-16 Thread Jeff Kirsher
From: Jesse Brandeburg 

The driver was using a value expressed in 2us increments
for the divisor to figure out our bytes/usec values.

Fix the usecs variable to contain a value in microseconds.

Change-ID: I5c20493103c295d6f201947bb908add7040b7c41
Signed-off-by: Jesse Brandeburg 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 9 -
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 9 -
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 889d25d..c96e581 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -830,6 +830,7 @@ static bool i40e_set_new_dynamic_itr(struct 
i40e_ring_container *rc)
enum i40e_latency_range new_latency_range = rc->latency_range;
u32 new_itr = rc->itr;
int bytes_per_int;
+   int usecs;
 
if (rc->total_packets == 0 || !rc->itr)
return false;
@@ -838,8 +839,14 @@ static bool i40e_set_new_dynamic_itr(struct 
i40e_ring_container *rc)
 *   0-10MB/s   lowest (10 ints/s)
 *  10-20MB/s   low(2 ints/s)
 *  20-1249MB/s bulk   (8000 ints/s)
+*
+* The math works out because the divisor is in 10^(-6) which
+* turns the bytes/us input value into MB/s values, but
+* make sure to use usecs, as the register values written
+* are in 2 usec increments in the ITR registers.
 */
-   bytes_per_int = rc->total_bytes / rc->itr;
+   usecs = (rc->itr << 1);
+   bytes_per_int = rc->total_bytes / usecs;
switch (new_latency_range) {
case I40E_LOWEST_LATENCY:
if (bytes_per_int > 10)
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 597e096..7d98042 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -333,6 +333,7 @@ static bool i40e_set_new_dynamic_itr(struct 
i40e_ring_container *rc)
enum i40e_latency_range new_latency_range = rc->latency_range;
u32 new_itr = rc->itr;
int bytes_per_int;
+   int usecs;
 
if (rc->total_packets == 0 || !rc->itr)
return false;
@@ -341,8 +342,14 @@ static bool i40e_set_new_dynamic_itr(struct 
i40e_ring_container *rc)
 *   0-10MB/s   lowest (10 ints/s)
 *  10-20MB/s   low(2 ints/s)
 *  20-1249MB/s bulk   (8000 ints/s)
+*
+* The math works out because the divisor is in 10^(-6) which
+* turns the bytes/us input value into MB/s values, but
+* make sure to use usecs, as the register values written
+* are in 2 usec increments in the ITR registers.
 */
-   bytes_per_int = rc->total_bytes / rc->itr;
+   usecs = (rc->itr << 1);
+   bytes_per_int = rc->total_bytes / usecs;
switch (new_latency_range) {
case I40E_LOWEST_LATENCY:
if (bytes_per_int > 10)
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next 14/17] i40e: Move error message to debug level

2015-10-16 Thread Jeff Kirsher
From: Catherine Sullivan 

There is an error coming back from get_phy_capabilities that does not
seem to have any functional implications. We will continue looking into
why this error message is occurring, but in the meantime, we will move it
to debug to avoid confusion.

Change-ID: I9091754bf62c066ddedeb249923d85606e2d68ed
Signed-off-by: Catherine Sullivan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 6e5f931..8fd26fd 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -751,9 +751,9 @@ static int i40e_set_settings(struct net_device *netdev,
 
status = i40e_update_link_info(hw);
if (status)
-   netdev_info(netdev, "Updating link info failed with err 
%s aq_err %s\n",
-   i40e_stat_str(hw, status),
-   i40e_aq_str(hw, hw->aq.asq_last_status));
+   netdev_dbg(netdev, "Updating link info failed with err 
%s aq_err %s\n",
+  i40e_stat_str(hw, status),
+  i40e_aq_str(hw, hw->aq.asq_last_status));
 
} else {
netdev_info(netdev, "Nothing changed, exiting without setting 
anything.\n");
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next 00/17][pull request] Intel Wired LAN Driver Updates 2015-10-16

2015-10-16 Thread Jeff Kirsher
This series contains updates to i40e and i40evf only.

Kiran adds a spinlock around code accessing VSI MAC filter list to
ensure that we are synchronizing access to the filter list, otherwise
we can end up with multiple accesses at the same time which can cause
the VSI MAC filter list to get in an unstable or corrupted state.

Jesse fixes overlong BIT defines, where the RSS enabling call were
mistakenly missed.  Also fixes a bug where the enable function was
enabling the interrupt twice while trying to update the two interrupt
throttle rate thresholds for Rx and Tx, while refactoring the IRQ
enable function to simplify reading the flow.  Addressed the high
CPU utilization of some small streaming workloads that the driver should
reduce CPU in.

Anjali fixes two X722 issues with respect to EEPROM checksum verify and
reading NVM version info.  Fixed where a mask value was accidentally
replaced with a bit mask causing Flow Director sideband to be broken.

Alex Duyck fixes areas of the drivers which run from hard interrupt
context or with interrupts already disabled in netpoll, so use
napi_schedule_irqoff() instead of napi_schedule().

Mitch fixes the VF drivers to not easily give up when it is not able
to communicate with the PF driver.

Carolyn fixes a problem where our tools MAC loopback test, after driver
unbind would fail because the hardware was configured for multiqueue and
unbind operation did not clear this configuration.  Also fixed a issue
where the NVMUpdate tool gets bad data from the PHY when using the PHY
NVM feature because of contention on the MDIO interface from getting
PHY capability calls from the driver during regular operations.

Catherine fixed an issue where we were checking if autoneg was allowed
to change before checking if autoneg was changing, these checks need to
be in the reverse order.

Jean Sacren fixes up an function header comment to align the kernel-docs
with the actual code.

The following are changes since commit 4be3158abe1e02d24f82b34101e41d662fae2185:
  Merge branch 'mlxsw-spectrum'
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue master

Alexander Duyck (1):
  i40e/i40evf: use napi_schedule_irqoff()

Anjali Singhai (1):
  i40e: Fix basic support for X722 devices

Anjali Singhai Jain (1):
  i40e/i40evf: Fix an accidental error with BIT_ULL replacement

Carolyn Wyborny (2):
  i40e: Fix for Tools loopback test failing after driver load
  i40e: fix for PHY NVM interaction problem

Catherine Sullivan (3):
  i40e: Fix order of checks when enabling/disabling autoneg in ethtool
  i40e: Move error message to debug level
  i40e/i40evf: Bump i40e to 1.3.38 and i40evf to 1.3.25

Jean Sacren (2):
  i40e: fix kernel-doc argument name
  i40e: declare rather than initialize int object

Jesse Brandeburg (5):
  i40evf: fix overlong BIT defines
  i40e/i40evf: refactor IRQ enable function
  i40e/i40evf: fix bug in throttle rate math
  i40e/i40evf: change dynamic interrupt thresholds
  i40e/i40evf: adjust interrupt throttle less frequently

Kiran Patil (1):
  i40e: Lock for VSI's MAC filter list

Mitch Williams (1):
  i40evf: don't give up

 drivers/net/ethernet/intel/i40e/i40e.h |   4 +
 drivers/net/ethernet/intel/i40e/i40e_common.c  |  17 +-
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |  35 ++-
 drivers/net/ethernet/intel/i40e/i40e_fcoe.c|   2 +
 drivers/net/ethernet/intel/i40e/i40e_main.c| 347 +
 drivers/net/ethernet/intel/i40e/i40e_nvm.c |  31 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c| 166 ++
 drivers/net/ethernet/intel/i40e/i40e_txrx.h|   7 +-
 drivers/net/ethernet/intel/i40e/i40e_type.h|   7 +-
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |  26 +-
 drivers/net/ethernet/intel/i40evf/i40e_common.c|   5 -
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c  | 163 ++
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h  |  21 +-
 drivers/net/ethernet/intel/i40evf/i40e_type.h  |   4 +-
 drivers/net/ethernet/intel/i40evf/i40evf.h |   2 +
 drivers/net/ethernet/intel/i40evf/i40evf_main.c|  13 +-
 16 files changed, 631 insertions(+), 219 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next 15/17] i40e: fix kernel-doc argument name

2015-10-16 Thread Jeff Kirsher
From: Jean Sacren 

The second argument name in the kernel-doc argument list for
i40e_features_check() was slightly off. Fix it for the kernel doc.

Signed-off-by: Jean Sacren 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 9b4fa3e..54af6d6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -8562,7 +8562,7 @@ static int i40e_ndo_bridge_getlink(struct sk_buff *skb, 
u32 pid, u32 seq,
 /**
  * i40e_features_check - Validate encapsulated packet conforms to limits
  * @skb: skb buff
- * @netdev: This physical port's netdev
+ * @dev: This physical port's netdev
  * @features: Offload features that the stack believes apply
  **/
 static netdev_features_t i40e_features_check(struct sk_buff *skb,
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next 16/17] i40e: declare rather than initialize int object

2015-10-16 Thread Jeff Kirsher
From: Jean Sacren 

'err' would be overwritten immediately, so we should declare it only
rather than initialize it to zero.

Signed-off-by: Jean Sacren 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 54af6d6..92b9d27 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -10179,7 +10179,7 @@ static int i40e_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
static u16 pfs_found;
u16 wol_nvm_bits;
u16 link_status;
-   int err = 0;
+   int err;
u32 len;
u32 i;
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next 03/17] i40e: Fix basic support for X722 devices

2015-10-16 Thread Jeff Kirsher
From: Anjali Singhai 

Acquire NVM, before issuing an AQ read nvm command for X722.
We need to acquire the NVM before issuing an AQ read to the NVM
otherwise we will get EBUSY from the FW. Also release when done.

This fixes the two X722 issues with respect to eeprom checksum verify
and reading NVM version info.

With this patch in place, i40e driver will provide basic support
for X722 devices.

Signed-off-by: Anjali Singhai Jain 
Acked-by: Jesse Brandeburg 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_common.c |  3 +++
 drivers/net/ethernet/intel/i40e/i40e_nvm.c| 31 +--
 drivers/net/ethernet/intel/i40e/i40e_type.h   |  3 +++
 3 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c 
b/drivers/net/ethernet/intel/i40e/i40e_common.c
index daa6b17..7f456e9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -958,6 +958,9 @@ i40e_status i40e_init_shared_code(struct i40e_hw *hw)
else
hw->pf_id = (u8)(func_rid & 0x7);
 
+   if (hw->mac.type == I40E_MAC_X722)
+   hw->flags |= I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE;
+
status = i40e_init_nvm(hw);
return status;
 }
diff --git a/drivers/net/ethernet/intel/i40e/i40e_nvm.c 
b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
index 875a8cc..58e384a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_nvm.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
@@ -290,9 +290,18 @@ static i40e_status i40e_read_nvm_word_aq(struct i40e_hw 
*hw, u16 offset,
 i40e_status i40e_read_nvm_word(struct i40e_hw *hw, u16 offset,
   u16 *data)
 {
-   if (hw->mac.type == I40E_MAC_X722)
-   return i40e_read_nvm_word_aq(hw, offset, data);
-   return i40e_read_nvm_word_srctl(hw, offset, data);
+   enum i40e_status_code ret_code = 0;
+
+   if (hw->flags & I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE) {
+   ret_code = i40e_acquire_nvm(hw, I40E_RESOURCE_READ);
+   if (!ret_code) {
+   ret_code = i40e_read_nvm_word_aq(hw, offset, data);
+   i40e_release_nvm(hw);
+   }
+   } else {
+   ret_code = i40e_read_nvm_word_srctl(hw, offset, data);
+   }
+   return ret_code;
 }
 
 /**
@@ -397,9 +406,19 @@ read_nvm_buffer_aq_exit:
 i40e_status i40e_read_nvm_buffer(struct i40e_hw *hw, u16 offset,
 u16 *words, u16 *data)
 {
-   if (hw->mac.type == I40E_MAC_X722)
-   return i40e_read_nvm_buffer_aq(hw, offset, words, data);
-   return i40e_read_nvm_buffer_srctl(hw, offset, words, data);
+   enum i40e_status_code ret_code = 0;
+
+   if (hw->flags & I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE) {
+   ret_code = i40e_acquire_nvm(hw, I40E_RESOURCE_READ);
+   if (!ret_code) {
+   ret_code = i40e_read_nvm_buffer_aq(hw, offset, words,
+  data);
+   i40e_release_nvm(hw);
+   }
+   } else {
+   ret_code = i40e_read_nvm_buffer_srctl(hw, offset, words, data);
+   }
+   return ret_code;
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_type.h 
b/drivers/net/ethernet/intel/i40e/i40e_type.h
index 9c4a457..f631ca0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_type.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_type.h
@@ -544,6 +544,9 @@ struct i40e_hw {
struct i40e_dcbx_config remote_dcbx_config; /* Peer Cfg */
struct i40e_dcbx_config desired_dcbx_config; /* CEE Desired Cfg */
 
+#define I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE BIT_ULL(0)
+   u64 flags;
+
/* debug mask */
u32 debug_mask;
char err_str[16];
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next 12/17] i40e/i40evf: Fix an accidental error with BIT_ULL replacement

2015-10-16 Thread Jeff Kirsher
From: Anjali Singhai Jain 

A mask value of 0x1FF was accidentally replaced with a bit mask
causing flow director sideband to be broken.

Change-ID: Id3387f67dd1b567b41692b570b383c58671e1eae
Signed-off-by: Anjali Singhai Jain 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_type.h   | 4 ++--
 drivers/net/ethernet/intel/i40evf/i40e_type.h | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_type.h 
b/drivers/net/ethernet/intel/i40e/i40e_type.h
index f631ca0..dd2da35 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_type.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_type.h
@@ -1068,8 +1068,8 @@ enum i40e_filter_program_desc_fd_status {
 };
 
 #define I40E_TXD_FLTR_QW0_DEST_VSI_SHIFT   23
-#define I40E_TXD_FLTR_QW0_DEST_VSI_MASK \
-  BIT_ULL(I40E_TXD_FLTR_QW0_DEST_VSI_SHIFT)
+#define I40E_TXD_FLTR_QW0_DEST_VSI_MASK(0x1FFUL << \
+I40E_TXD_FLTR_QW0_DEST_VSI_SHIFT)
 
 #define I40E_TXD_FLTR_QW1_CMD_SHIFT4
 #define I40E_TXD_FLTR_QW1_CMD_MASK (0xULL << \
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_type.h 
b/drivers/net/ethernet/intel/i40evf/i40e_type.h
index 85af3b4..301fe2b 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_type.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_type.h
@@ -1055,8 +1055,8 @@ enum i40e_filter_program_desc_fd_status {
 };
 
 #define I40E_TXD_FLTR_QW0_DEST_VSI_SHIFT   23
-#define I40E_TXD_FLTR_QW0_DEST_VSI_MASK \
-  BIT_ULL(I40E_TXD_FLTR_QW0_DEST_VSI_SHIFT)
+#define I40E_TXD_FLTR_QW0_DEST_VSI_MASK(0x1FFUL << \
+I40E_TXD_FLTR_QW0_DEST_VSI_SHIFT)
 
 #define I40E_TXD_FLTR_QW1_CMD_SHIFT4
 #define I40E_TXD_FLTR_QW1_CMD_MASK (0xULL << \
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next 01/17] i40e: Lock for VSI's MAC filter list

2015-10-16 Thread Jeff Kirsher
From: Kiran Patil 

This patch introduces a spinlock which is to be used for synchronizing
access to VSI's MAC filter list.

This patch also synchronizes execution of other codepaths which are
accessing VSI's MAC filter list with execution of
service_task:sync_vsi_filters.

In function i40e_add_vsi, copied out LAA MAC address instead of cloning
MAC filter entry because only MAC address is needed to remove MAC VLAN
filter from FW/HW.

Change-ID: I0e10ac7c715d44aa994239642aa4d57c998573a2
Signed-off-by: Kiran Patil 
Signed-off-by: Jesse Brandeburg 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |   2 +
 drivers/net/ethernet/intel/i40e/i40e_fcoe.c|   2 +
 drivers/net/ethernet/intel/i40e/i40e_main.c| 328 +
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |  26 +-
 4 files changed, 302 insertions(+), 56 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 7a58b1f..b7818be 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -468,6 +468,8 @@ struct i40e_vsi {
 #define I40E_VSI_FLAG_VEB_OWNERBIT(1)
unsigned long flags;
 
+   /* Per VSI lock to protect elements/list (MAC filter) */
+   spinlock_t mac_filter_list_lock;
struct list_head mac_filter_list;
 
/* VSI stats */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_fcoe.c 
b/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
index 2ec2411..fe5d9bf 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
@@ -1516,10 +1516,12 @@ void i40e_fcoe_config_netdev(struct net_device *netdev, 
struct i40e_vsi *vsi)
 * same PCI function.
 */
netdev->dev_port = 1;
+   spin_lock_bh(&vsi->mac_filter_list_lock);
i40e_add_filter(vsi, hw->mac.san_addr, 0, false, false);
i40e_add_filter(vsi, (u8[6]) FC_FCOE_FLOGI_MAC, 0, false, false);
i40e_add_filter(vsi, FIP_ALL_FCOE_MACS, 0, false, false);
i40e_add_filter(vsi, FIP_ALL_ENODE_MACS, 0, false, false);
+   spin_unlock_bh(&vsi->mac_filter_list_lock);
 
/* use san mac */
ether_addr_copy(netdev->dev_addr, hw->mac.san_addr);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 87a5d09..aa2aafd 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -1223,6 +1223,9 @@ static struct i40e_mac_filter *i40e_find_filter(struct 
i40e_vsi *vsi,
 {
struct i40e_mac_filter *f;
 
+   WARN(!spin_is_locked(&vsi->mac_filter_list_lock),
+"Missing mac_filter_list_lock\n");
+
if (!vsi || !macaddr)
return NULL;
 
@@ -1251,6 +1254,9 @@ struct i40e_mac_filter *i40e_find_mac(struct i40e_vsi 
*vsi, u8 *macaddr,
 {
struct i40e_mac_filter *f;
 
+   WARN(!spin_is_locked(&vsi->mac_filter_list_lock),
+"Missing mac_filter_list_lock\n");
+
if (!vsi || !macaddr)
return NULL;
 
@@ -1273,6 +1279,9 @@ bool i40e_is_vsi_in_vlan(struct i40e_vsi *vsi)
 {
struct i40e_mac_filter *f;
 
+   WARN(!spin_is_locked(&vsi->mac_filter_list_lock),
+"Missing mac_filter_list_lock\n");
+
/* Only -1 for all the filters denotes not in vlan mode
 * so we have to go through all the list in order to make sure
 */
@@ -1301,6 +1310,9 @@ struct i40e_mac_filter *i40e_put_mac_in_vlan(struct 
i40e_vsi *vsi, u8 *macaddr,
 {
struct i40e_mac_filter *f;
 
+   WARN(!spin_is_locked(&vsi->mac_filter_list_lock),
+"Missing mac_filter_list_lock\n");
+
list_for_each_entry(f, &vsi->mac_filter_list, list) {
if (vsi->info.pvid)
f->vlan = le16_to_cpu(vsi->info.pvid);
@@ -1355,6 +1367,10 @@ static int i40e_rm_default_mac_filter(struct i40e_vsi 
*vsi, u8 *macaddr)
  * @is_netdev: make sure its a netdev filter, else doesn't matter
  *
  * Returns ptr to the filter object or NULL when no memory available.
+ *
+ * NOTE: This function is expected to be called with mac_filter_list_lock
+ * being held. If needed could add WARN/BUG_ON if lock is not held for debug
+ * purpose.
  **/
 struct i40e_mac_filter *i40e_add_filter(struct i40e_vsi *vsi,
u8 *macaddr, s16 vlan,
@@ -1362,6 +1378,9 @@ struct i40e_mac_filter *i40e_add_filter(struct i40e_vsi 
*vsi,
 {
struct i40e_mac_filter *f;
 
+   WARN(!spin_is_locked(&vsi->mac_filter_list_lock),
+"Missing mac_filter_list_lock\n");
+
if (!vsi || !macaddr)
return NULL;
 
@@ -1413,6 +1432,10 @@ add_filter_out:
  * @vlan: the vlan
  * @is_vf: make sure it's a VF filter, else doesn't matter
  * @is_netdev: make sure it's a netdev filter, else doesn't matter
+ *
+ * NOTE: This function is expected to be call

[net-next 17/17] i40e/i40evf: Bump i40e to 1.3.38 and i40evf to 1.3.25

2015-10-16 Thread Jeff Kirsher
From: Catherine Sullivan 

Bump.

Change-ID: Id0a7ecaa491f88ce94c9eba4901e592a56044ee0
Signed-off-by: Catherine Sullivan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 92b9d27..7af52d3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -39,7 +39,7 @@ static const char i40e_driver_string[] =
 
 #define DRV_VERSION_MAJOR 1
 #define DRV_VERSION_MINOR 3
-#define DRV_VERSION_BUILD 34
+#define DRV_VERSION_BUILD 38
 #define DRV_VERSION __stringify(DRV_VERSION_MAJOR) "." \
 __stringify(DRV_VERSION_MINOR) "." \
 __stringify(DRV_VERSION_BUILD)DRV_KERN
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index a44a42b..10b2a4a 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -34,7 +34,7 @@ char i40evf_driver_name[] = "i40evf";
 static const char i40evf_driver_string[] =
"Intel(R) XL710/X710 Virtual Function Network Driver";
 
-#define DRV_VERSION "1.3.21"
+#define DRV_VERSION "1.3.25"
 const char i40evf_driver_version[] = DRV_VERSION;
 static const char i40evf_copyright[] =
"Copyright (c) 2013 - 2015 Intel Corporation.";
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next 08/17] i40e/i40evf: change dynamic interrupt thresholds

2015-10-16 Thread Jeff Kirsher
From: Jesse Brandeburg 

The dynamic algorithm, while now working, doesn't have good
performance in 40G mode.

One part of this patch addresses the high CPU utilization of some small
streaming workloads that the driver should reduce CPU in.

It also changes the minimum ITR that the dynamic algorithm
will settle on, causing our minimum latency to go from 12us
to about 14us, when using adaptive mode.

It also changes the BULK interrupt rate to allow maximum throughput
on a 40Gb connection with a single thread of transmit, clamping
interrupt rate to 8000 for TX makes single thread traffic go too
slow.

The new ULTRA bulk setting is introduced and is used
when the Rx packet rate on this queue exceeds 4 packets per
second.  This value of 4 was chosen because the automatic tuning
of minimum ITR=20us means that a single queue can't quite achieve
that many packets per second from a round-robin test.

Change-ID: Icce8faa128688ca5fd2c4229bdd9726877a92ea2
Signed-off-by: Jesse Brandeburg 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 27 +--
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  3 +++
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 27 +--
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h |  3 +++
 4 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index c96e581..7aea143 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -828,6 +828,7 @@ void i40e_force_wb(struct i40e_vsi *vsi, struct 
i40e_q_vector *q_vector)
 static bool i40e_set_new_dynamic_itr(struct i40e_ring_container *rc)
 {
enum i40e_latency_range new_latency_range = rc->latency_range;
+   struct i40e_q_vector *qv = rc->ring->q_vector;
u32 new_itr = rc->itr;
int bytes_per_int;
int usecs;
@@ -836,9 +837,10 @@ static bool i40e_set_new_dynamic_itr(struct 
i40e_ring_container *rc)
return false;
 
/* simple throttlerate management
-*   0-10MB/s   lowest (10 ints/s)
+*   0-10MB/s   lowest (5 ints/s)
 *  10-20MB/s   low(2 ints/s)
-*  20-1249MB/s bulk   (8000 ints/s)
+*  20-1249MB/s bulk   (18000 ints/s)
+*  > 4 Rx packets per second (8000 ints/s)
 *
 * The math works out because the divisor is in 10^(-6) which
 * turns the bytes/us input value into MB/s values, but
@@ -859,24 +861,37 @@ static bool i40e_set_new_dynamic_itr(struct 
i40e_ring_container *rc)
new_latency_range = I40E_LOWEST_LATENCY;
break;
case I40E_BULK_LATENCY:
-   if (bytes_per_int <= 20)
-   new_latency_range = I40E_LOW_LATENCY;
-   break;
+   case I40E_ULTRA_LATENCY:
default:
if (bytes_per_int <= 20)
new_latency_range = I40E_LOW_LATENCY;
break;
}
+
+   /* this is to adjust RX more aggressively when streaming small
+* packets.  The value of 4 was picked as it is just beyond
+* what the hardware can receive per second if in low latency
+* mode.
+*/
+#define RX_ULTRA_PACKET_RATE 4
+
+   if rc->total_packets * 100) / usecs) > RX_ULTRA_PACKET_RATE) &&
+   (&qv->rx == rc))
+   new_latency_range = I40E_ULTRA_LATENCY;
+
rc->latency_range = new_latency_range;
 
switch (new_latency_range) {
case I40E_LOWEST_LATENCY:
-   new_itr = I40E_ITR_100K;
+   new_itr = I40E_ITR_50K;
break;
case I40E_LOW_LATENCY:
new_itr = I40E_ITR_20K;
break;
case I40E_BULK_LATENCY:
+   new_itr = I40E_ITR_18K;
+   break;
+   case I40E_ULTRA_LATENCY:
new_itr = I40E_ITR_8K;
break;
default:
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 7c0ed84..0fe7eb7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -32,7 +32,9 @@
 #define I40E_MAX_ITR   0x0FF0  /* reg uses 2 usec resolution */
 #define I40E_MIN_ITR   0x0001  /* reg uses 2 usec resolution */
 #define I40E_ITR_100K  0x0005
+#define I40E_ITR_50K   0x000A
 #define I40E_ITR_20K   0x0019
+#define I40E_ITR_18K   0x001B
 #define I40E_ITR_8K0x003E
 #define I40E_ITR_4K0x007A
 #define I40E_MAX_INTRL 0x3B/* reg uses 4 usec resolution */
@@ -296,6 +298,7 @@ enum i40e_latency_range {
I40E_LOWEST_LATENCY = 0,
I40E_LOW_LATENCY = 1,
I40E_BULK_LATENCY = 2,
+   I40E_ULTRA_LATENCY = 3,
 };

[net-next 10/17] i40e: Fix for Tools loopback test failing after driver load

2015-10-16 Thread Jeff Kirsher
From: Carolyn Wyborny 

This patch fixes a problem where our Tools MAC Loopback test, after
driver unbind would fail.  This was because the hw was configured
for multiqueue and unbind operation did not clear this configuration.
The problem is fixed by resetting this configuration in i40e_remove.

Change-ID: I130c05138319182ed1476d3a0b5222d6a6320af9
Signed-off-by: Carolyn Wyborny 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index b105da1..9b4fa3e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -10650,6 +10650,7 @@ err_dma:
 static void i40e_remove(struct pci_dev *pdev)
 {
struct i40e_pf *pf = pci_get_drvdata(pdev);
+   struct i40e_hw *hw = &pf->hw;
i40e_status ret_code;
int i;
 
@@ -10657,6 +10658,10 @@ static void i40e_remove(struct pci_dev *pdev)
 
i40e_ptp_stop(pf);
 
+   /* Disable RSS in hw */
+   wr32(hw, I40E_PFQF_HENA(0), 0);
+   wr32(hw, I40E_PFQF_HENA(1), 0);
+
/* no more scheduling of any task */
set_bit(__I40E_DOWN, &pf->state);
del_timer_sync(&pf->service_timer);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next 09/17] i40e/i40evf: adjust interrupt throttle less frequently

2015-10-16 Thread Jeff Kirsher
From: Jesse Brandeburg 

The adaptive ITR (interrupt throttle rate) algorithm was adjusting
the hardware's interrupt rate too frequently.  This caused a lot
of variation in the interrupt rate for fairly constant workloads.

Change the code to have a counter and adjust only once every N
number of interrupts.

Change-ID: I0460f1f86571037484eca5aca36ac4d889cb8389
Signed-off-by: Jesse Brandeburg 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h  |  2 ++
 drivers/net/ethernet/intel/i40e/i40e_main.c |  2 ++
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 22 --
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  4 ++--
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c   | 23 +--
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h   |  4 ++--
 drivers/net/ethernet/intel/i40evf/i40evf.h  |  2 ++
 drivers/net/ethernet/intel/i40evf/i40evf_main.c |  2 ++
 8 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index b7818be..7c2b2e8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -577,6 +577,8 @@ struct i40e_q_vector {
struct rcu_head rcu;/* to avoid race with update stats on free */
char name[I40E_INT_NAME_STR_LEN];
bool arm_wb_state;
+#define ITR_COUNTDOWN_START 100
+   u8 itr_countdown;   /* when 0 should adjust ITR */
 } cacheline_internodealigned_in_smp;
 
 /* lan device */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 11dbb29..b105da1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3107,6 +3107,7 @@ static void i40e_vsi_configure_msix(struct i40e_vsi *vsi)
for (i = 0; i < vsi->num_q_vectors; i++, vector++) {
struct i40e_q_vector *q_vector = vsi->q_vectors[i];
 
+   q_vector->itr_countdown = ITR_COUNTDOWN_START;
q_vector->rx.itr = ITR_TO_REG(vsi->rx_itr_setting);
q_vector->rx.latency_range = I40E_LOW_LATENCY;
wr32(hw, I40E_PFINT_ITRN(I40E_RX_ITR, vector - 1),
@@ -3202,6 +3203,7 @@ static void i40e_configure_msi_and_legacy(struct i40e_vsi 
*vsi)
u32 val;
 
/* set the ITR configuration */
+   q_vector->itr_countdown = ITR_COUNTDOWN_START;
q_vector->rx.itr = ITR_TO_REG(vsi->rx_itr_setting);
q_vector->rx.latency_range = I40E_LOW_LATENCY;
wr32(hw, I40E_PFINT_ITR0(I40E_RX_ITR), q_vector->rx.itr);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 7aea143..006f0fb 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -845,10 +845,12 @@ static bool i40e_set_new_dynamic_itr(struct 
i40e_ring_container *rc)
 * The math works out because the divisor is in 10^(-6) which
 * turns the bytes/us input value into MB/s values, but
 * make sure to use usecs, as the register values written
-* are in 2 usec increments in the ITR registers.
+* are in 2 usec increments in the ITR registers, and make sure
+* to use the smoothed values that the countdown timer gives us.
 */
-   usecs = (rc->itr << 1);
+   usecs = (rc->itr << 1) * ITR_COUNTDOWN_START;
bytes_per_int = rc->total_bytes / usecs;
+
switch (new_latency_range) {
case I40E_LOWEST_LATENCY:
if (bytes_per_int > 10)
@@ -1806,8 +1808,17 @@ static inline void i40e_update_enable_itr(struct 
i40e_vsi *vsi,
 
vector = (q_vector->v_idx + vsi->base_vector);
 
+   /* avoid dynamic calculation if in countdown mode OR if
+* all dynamic is disabled
+*/
rxval = txval = i40e_buildreg_itr(I40E_ITR_NONE, 0);
 
+   if (q_vector->itr_countdown > 0 ||
+   (!ITR_IS_DYNAMIC(vsi->rx_itr_setting) &&
+!ITR_IS_DYNAMIC(vsi->tx_itr_setting))) {
+   goto enable_int;
+   }
+
if (ITR_IS_DYNAMIC(vsi->rx_itr_setting)) {
rx = i40e_set_new_dynamic_itr(&q_vector->rx);
rxval = i40e_buildreg_itr(I40E_RX_ITR, q_vector->rx.itr);
@@ -1845,8 +1856,15 @@ static inline void i40e_update_enable_itr(struct 
i40e_vsi *vsi,
wr32(hw, INTREG(vector - 1), rxval);
}
 
+enable_int:
if (!test_bit(__I40E_DOWN, &vsi->state))
wr32(hw, INTREG(vector - 1), txval);
+
+   if (q_vector->itr_countdown)
+   q_vector->itr_countdown--;
+   else
+   q_vector->itr_countdown = ITR_COUNTDOWN_START;
+
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 0fe7eb7..6779fb7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/

[net-next 02/17] i40evf: fix overlong BIT defines

2015-10-16 Thread Jeff Kirsher
From: Jesse Brandeburg 

The defines from the RSS enabling call were mistakenly
missed in the patches to the i40e which should have been
to i40evf as well.

This is a follow up to (commit ed921559886dd40528) "fix
32 bit build warnings".

Signed-off-by: Jesse Brandeburg 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40e_common.c |  5 -
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h   | 14 +++---
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40e_common.c 
b/drivers/net/ethernet/intel/i40evf/i40e_common.c
index 7435fb3..a4c5a49 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_common.c
@@ -443,9 +443,6 @@ static i40e_status i40e_aq_get_set_rss_lut(struct i40e_hw 
*hw,
I40E_AQC_SET_RSS_LUT_TABLE_TYPE_SHIFT) &
I40E_AQC_SET_RSS_LUT_TABLE_TYPE_MASK));
 
-   cmd_resp->addr_high = cpu_to_le32(high_16_bits((u64)lut));
-   cmd_resp->addr_low = cpu_to_le32(lower_32_bits((u64)lut));
-
status = i40evf_asq_send_command(hw, &desc, lut, lut_size, NULL);
 
return status;
@@ -520,8 +517,6 @@ static i40e_status i40e_aq_get_set_rss_key(struct i40e_hw 
*hw,
  I40E_AQC_SET_RSS_KEY_VSI_ID_SHIFT) &
  I40E_AQC_SET_RSS_KEY_VSI_ID_MASK));
cmd_resp->vsi_id |= cpu_to_le16((u16)I40E_AQC_SET_RSS_KEY_VSI_VALID);
-   cmd_resp->addr_high = cpu_to_le32(high_16_bits((u64)key));
-   cmd_resp->addr_low = cpu_to_le32(lower_32_bits((u64)key));
 
status = i40evf_asq_send_command(hw, &desc, key, key_size, NULL);
 
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
index c4f5a4e..03275eb 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
@@ -89,16 +89,16 @@ enum i40e_dyn_idx_t {
BIT_ULL(I40E_FILTER_PCTYPE_L2_PAYLOAD))
 
 #define I40E_DEFAULT_RSS_HENA_EXPANDED (I40E_DEFAULT_RSS_HENA | \
-   BIT(I40E_FILTER_PCTYPE_NONF_IPV4_TCP_SYN_NO_ACK) | \
-   BIT(I40E_FILTER_PCTYPE_NONF_UNICAST_IPV4_UDP) | \
-   BIT(I40E_FILTER_PCTYPE_NONF_MULTICAST_IPV4_UDP) | \
-   BIT(I40E_FILTER_PCTYPE_NONF_IPV6_TCP_SYN_NO_ACK) | \
-   BIT(I40E_FILTER_PCTYPE_NONF_UNICAST_IPV6_UDP) | \
-   BIT(I40E_FILTER_PCTYPE_NONF_MULTICAST_IPV6_UDP))
+   BIT_ULL(I40E_FILTER_PCTYPE_NONF_IPV4_TCP_SYN_NO_ACK) | \
+   BIT_ULL(I40E_FILTER_PCTYPE_NONF_UNICAST_IPV4_UDP) | \
+   BIT_ULL(I40E_FILTER_PCTYPE_NONF_MULTICAST_IPV4_UDP) | \
+   BIT_ULL(I40E_FILTER_PCTYPE_NONF_IPV6_TCP_SYN_NO_ACK) | \
+   BIT_ULL(I40E_FILTER_PCTYPE_NONF_UNICAST_IPV6_UDP) | \
+   BIT_ULL(I40E_FILTER_PCTYPE_NONF_MULTICAST_IPV6_UDP))
 
 #define i40e_pf_get_default_rss_hena(pf) \
(((pf)->flags & I40E_FLAG_MULTIPLE_TCP_UDP_RSS_PCTYPE) ? \
-   I40E_DEFAULT_RSS_HENA_EXPANDED : I40E_DEFAULT_RSS_HENA)
+ I40E_DEFAULT_RSS_HENA_EXPANDED : I40E_DEFAULT_RSS_HENA)
 
 /* Supported Rx Buffer Sizes */
 #define I40E_RXBUFFER_512   512/* Used for packet split */
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next 13/17] i40e: Fix order of checks when enabling/disabling autoneg in ethtool

2015-10-16 Thread Jeff Kirsher
From: Catherine Sullivan 

We were previously checking if autoneg was allowed to change before
checking if autoneg was changing. We need to do this in the other order
or else we will erroneously return EINVAL when autoneg is not changing.

Change-ID: Iff9f7d1c9bddc1ad1e5d227d4f42754f90155410
Signed-off-by: Catherine Sullivan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 29 ++
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 767b1db..6e5f931 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -661,28 +661,31 @@ static int i40e_set_settings(struct net_device *netdev,
 
/* Check autoneg */
if (autoneg == AUTONEG_ENABLE) {
-   /* If autoneg is not supported, return error */
-   if (!(safe_ecmd.supported & SUPPORTED_Autoneg)) {
-   netdev_info(netdev, "Autoneg not supported on this 
phy\n");
-   return -EINVAL;
-   }
/* If autoneg was not already enabled */
if (!(hw->phy.link_info.an_info & I40E_AQ_AN_COMPLETED)) {
+   /* If autoneg is not supported, return error */
+   if (!(safe_ecmd.supported & SUPPORTED_Autoneg)) {
+   netdev_info(netdev, "Autoneg not supported on 
this phy\n");
+   return -EINVAL;
+   }
+   /* Autoneg is allowed to change */
config.abilities = abilities.abilities |
   I40E_AQ_PHY_ENABLE_AN;
change = true;
}
} else {
-   /* If autoneg is supported 10GBASE_T is the only phy that
-* can disable it, so otherwise return error
-*/
-   if (safe_ecmd.supported & SUPPORTED_Autoneg &&
-   hw->phy.link_info.phy_type != I40E_PHY_TYPE_10GBASE_T) {
-   netdev_info(netdev, "Autoneg cannot be disabled on this 
phy\n");
-   return -EINVAL;
-   }
/* If autoneg is currently enabled */
if (hw->phy.link_info.an_info & I40E_AQ_AN_COMPLETED) {
+   /* If autoneg is supported 10GBASE_T is the only PHY
+* that can disable it, so otherwise return error
+*/
+   if (safe_ecmd.supported & SUPPORTED_Autoneg &&
+   hw->phy.link_info.phy_type !=
+   I40E_PHY_TYPE_10GBASE_T) {
+   netdev_info(netdev, "Autoneg cannot be disabled 
on this phy\n");
+   return -EINVAL;
+   }
+   /* Autoneg is allowed to change */
config.abilities = abilities.abilities &
   ~I40E_AQ_PHY_ENABLE_AN;
change = true;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next 04/17] i40e/i40evf: use napi_schedule_irqoff()

2015-10-16 Thread Jeff Kirsher
From: Alexander Duyck 

The i40e_intr and i40e/i40evf_msix_clean_rings functions run from hard
interrupt context or with interrupts already disabled in netpoll.

They can use napi_schedule_irqoff() instead of napi_schedule()

Signed-off-by: Alexander Duyck 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 6 --
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 +-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index aa2aafd..11dbb29 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3287,7 +3287,7 @@ static irqreturn_t i40e_msix_clean_rings(int irq, void 
*data)
if (!q_vector->tx.ring && !q_vector->rx.ring)
return IRQ_HANDLED;
 
-   napi_schedule(&q_vector->napi);
+   napi_schedule_irqoff(&q_vector->napi);
 
return IRQ_HANDLED;
 }
@@ -3456,6 +3456,8 @@ static irqreturn_t i40e_intr(int irq, void *data)
 
/* only q0 is used in MSI/Legacy mode, and none are used in MSIX */
if (icr0 & I40E_PFINT_ICR0_QUEUE_0_MASK) {
+   struct i40e_vsi *vsi = pf->vsi[pf->lan_vsi];
+   struct i40e_q_vector *q_vector = vsi->q_vectors[0];
 
/* temporarily disable queue cause for NAPI processing */
u32 qval = rd32(hw, I40E_QINT_RQCTL(0));
@@ -3468,7 +3470,7 @@ static irqreturn_t i40e_intr(int irq, void *data)
wr32(hw, I40E_QINT_TQCTL(0), qval);
 
if (!test_bit(__I40E_DOWN, &pf->state))
-   
napi_schedule(&pf->vsi[pf->lan_vsi]->q_vectors[0]->napi);
+   napi_schedule_irqoff(&q_vector->napi);
}
 
if (icr0 & I40E_PFINT_ICR0_ADMINQ_MASK) {
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 4c4340c..1dd5245 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -334,7 +334,7 @@ static irqreturn_t i40evf_msix_clean_rings(int irq, void 
*data)
if (!q_vector->tx.ring && !q_vector->rx.ring)
return IRQ_HANDLED;
 
-   napi_schedule(&q_vector->napi);
+   napi_schedule_irqoff(&q_vector->napi);
 
return IRQ_HANDLED;
 }
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next 11/17] i40e: fix for PHY NVM interaction problem

2015-10-16 Thread Jeff Kirsher
From: Carolyn Wyborny 

This patch fixes a problem where the NVMUpdate Tool, when using the PHY
NVM feature, gets bad data from the PHY because of contention on the
MDIO interface from get PHY capability calls from the driver during
regular operations.  The problem is fixed by adding a check if media
is available before calling get PHY capability function because that
bit is not set when device is in PHY interaction mode.

Change-ID: Ib89991b0f841808dd92410f5e8683d6ee3301cd0
Signed-off-by: Carolyn Wyborny 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_common.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c 
b/drivers/net/ethernet/intel/i40e/i40e_common.c
index 7f456e9..c51f2fb 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -2278,13 +2278,15 @@ i40e_status i40e_update_link_info(struct i40e_hw *hw)
if (status)
return status;
 
-   status = i40e_aq_get_phy_capabilities(hw, false, false, &abilities,
- NULL);
-   if (status)
-   return status;
+   if (hw->phy.link_info.link_info & I40E_AQ_MEDIA_AVAILABLE) {
+   status = i40e_aq_get_phy_capabilities(hw, false, false,
+ &abilities, NULL);
+   if (status)
+   return status;
 
-   memcpy(hw->phy.link_info.module_type, &abilities.module_type,
-  sizeof(hw->phy.link_info.module_type));
+   memcpy(hw->phy.link_info.module_type, &abilities.module_type,
+  sizeof(hw->phy.link_info.module_type));
+   }
 
return status;
 }
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


pull-request: wireless-drivers 2015-10-17

2015-10-16 Thread Kalle Valo
Hi Dave,

few small fixes I would like to get to 4.3 still. Please let me know if
there are any problems.

Kalle

The following changes since commit 76d164f582150fd0259ec0fcbc485470bcd8033e:

  ath10k: fix DMA related firmware crashes on multiple devices (2015-09-26 
20:48:25 +0300)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers.git 
tags/wireless-drivers-for-davem-2015-10-17

for you to fetch changes up to de28a05ee28e36fa1fc68fcfc7c6cc1c6f7c270c:

  Merge tag 'iwlwifi-for-kalle-2015-10-05' of 
git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-fixes (2015-10-07 
11:12:01 +0300)



iwlwifi:

* mvm: flush fw_dump_wk when mvm fails to start
* mvm: init card correctly on ctkill exit check
* pci: add a few more PCI subvendor IDs for the 7265 series
* fix firmware filename for 3160
* mvm: clear csa countdown when AP is stopped
* mvm: fix D3 firmware PN programming
* dvm: fix D3 firmware PN programming
* mvm: fix D3 CCMP TX PN assignment

rtlwifi:

* rtl8821ae: Fix system lockups on boot


Andrei Otcheretianski (1):
  iwlwifi: mvm: flush fw_dump_wk when mvm fails to start

Arik Nemtsov (1):
  iwlwifi: mvm: init card correctly on ctkill exit check

Avraham Stern (1):
  iwlwifi: mvm: clear csa countdown when AP is stopped

Johannes Berg (4):
  iwlwifi: mvm: fix D3 CCMP TX PN assignment
  iwlwifi: dvm: fix D3 firmware PN programming
  iwlwifi: mvm: fix D3 firmware PN programming
  iwlwifi: fix firmware filename for 3160

Kalle Valo (1):
  Merge tag 'iwlwifi-for-kalle-2015-10-05' of 
git://git.kernel.org/.../iwlwifi/iwlwifi-fixes

Larry Finger (1):
  rtlwifi: rtl8821ae: Fix system lockups on boot

Luca Coelho (1):
  iwlwifi: pci: add a few more PCI subvendor IDs for the 7265 series

 drivers/net/wireless/iwlwifi/dvm/lib.c  |2 +-
 drivers/net/wireless/iwlwifi/iwl-7000.c |2 +-
 drivers/net/wireless/iwlwifi/mvm/d3.c   |   27 +++
 drivers/net/wireless/iwlwifi/mvm/fw.c   |4 ++--
 drivers/net/wireless/iwlwifi/mvm/mac80211.c |1 +
 drivers/net/wireless/iwlwifi/mvm/mvm.h  |5 +
 drivers/net/wireless/iwlwifi/mvm/ops.c  |1 +
 drivers/net/wireless/iwlwifi/pcie/drv.c |5 +
 drivers/net/wireless/rtlwifi/pci.h  |2 ++
 drivers/net/wireless/rtlwifi/rtl8821ae/hw.c |   17 +
 drivers/net/wireless/rtlwifi/rtl8821ae/sw.c |5 +
 drivers/net/wireless/rtlwifi/wifi.h |3 +++
 12 files changed, 54 insertions(+), 20 deletions(-)

-- 
Kalle Valo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] uapi: add mpls_iptunnel.h

2015-10-16 Thread roopa
On 10/16/15, 4:12 PM, Stephen Hemminger wrote:
> Add missing rule to export mpls iptunnel header needed by iproute2
>
> Signed-off-by: Stephen Hemminger 
Acked-by: Roopa Prabhu 

Thanks Stephen.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 5/7] tcp: skb_mstamp_after helper

2015-10-16 Thread Yuchung Cheng
a helper to prepare the first main RACK patch.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
---
 include/linux/skbuff.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4398411..24f4dfd 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -463,6 +463,15 @@ static inline u32 skb_mstamp_us_delta(const struct 
skb_mstamp *t1,
return delta_us;
 }
 
+static inline bool skb_mstamp_after(const struct skb_mstamp *t1,
+   const struct skb_mstamp *t0)
+{
+   s32 diff = t1->stamp_jiffies - t0->stamp_jiffies;
+
+   if (!diff)
+   diff = t1->stamp_us - t0->stamp_us;
+   return diff > 0;
+}
 
 /** 
  * struct sk_buff - socket buffer
-- 
2.6.0.rc2.230.g3dd15c0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 4/7] tcp: add tcp_tsopt_ecr_before helper

2015-10-16 Thread Yuchung Cheng
a helper to prepare the main RACK patch

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_input.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5ad08d8..c304b5f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2250,14 +2250,19 @@ static inline void tcp_moderate_cwnd(struct tcp_sock 
*tp)
tp->snd_cwnd_stamp = tcp_time_stamp;
 }
 
+static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when)
+{
+   return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
+  before(tp->rx_opt.rcv_tsecr, when);
+}
+
 /* Nothing was retransmitted or returned timestamp is less
  * than timestamp of the first retransmission.
  */
 static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
 {
return !tp->retrans_stamp ||
-   (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
-before(tp->rx_opt.rcv_tsecr, tp->retrans_stamp));
+  tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
 }
 
 /* Undo procedures. */
-- 
2.6.0.rc2.230.g3dd15c0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 6/7] tcp: track the packet timings in RACK

2015-10-16 Thread Yuchung Cheng
This patch is the first half of the RACK loss recovery.

RACK loss recovery uses the notion of time instead
of packet sequence (FACK) or counts (dupthresh). It's inspired by the
previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
transmit (new data packet) is sacked, then current retransmitted
sequence below the newly sacked sequence must been lost,
since at least one round trip time has elapsed.

But it has several limitations:
1) can't detect tail drops since it depends on limited transmit
2) is disabled upon reordering (assumes no reordering)
3) only enabled in fast recovery ut not timeout recovery

RACK (Recently ACK) addresses these limitations with the notion
of time instead: a packet P1 is lost if a later packet P2 is s/acked,
as at least one round trip has passed.

Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when later retransmission is
s/acked while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering.

This patch implements tcp_advanced_rack() which tracks the
most recent transmission time among the packets that have been
delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
is the key to determine which packet has been lost.

Consider an example that the sender sends six packets:
T1: P1 (lost)
T2: P2
T3: P3
T4: P4
T100: sack of P2. rack.mstamp = T2
T101: retransmit P1
T102: sack of P2,P3,P4. rack.mstamp = T4
T205: ACK of P4 since the hole is repaired. rack.mstamp = T101

We need to be careful about spurious retransmission because it may
falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
to falsely mark all packets lost, just like a spurious timeout.

We identify spurious retransmission by the ACK's TS echo value.
If TS option is not applicable but the retransmission is acknowledged
less than min-RTT ago, it is likely to be spurious. We refrain from
using the transmission time of these spurious retransmissions.

The second half is implemented in the next patch that marks packet
lost using RACK timestamp.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
---
 include/linux/tcp.h  |  6 ++
 include/net/tcp.h|  5 +
 net/ipv4/Makefile|  1 +
 net/ipv4/tcp_input.c | 14 ++
 net/ipv4/tcp_minisocks.c |  2 ++
 net/ipv4/tcp_recovery.c  | 32 
 6 files changed, 60 insertions(+)
 create mode 100644 net/ipv4/tcp_recovery.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 8c54863..5dce970 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -194,6 +194,12 @@ struct tcp_sock {
u32 window_clamp;   /* Maximal window to advertise  */
u32 rcv_ssthresh;   /* Current window clamp */
 
+   /* Information of the most recently (s)acked skb */
+   struct tcp_rack {
+   struct skb_mstamp mstamp; /* (Re)sent time of the skb */
+   u8 advanced; /* mstamp advanced since last lost marking */
+   u8 reord;/* reordering detected */
+   } rack;
u16 advmss; /* Advertised MSS   */
u8  unused;
u8  nonagle : 4,/* Disable Nagle algorithm? */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4575f0e..aee5f23 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1750,6 +1750,11 @@ int tcpv4_offload_init(void);
 void tcp_v4_init(void);
 void tcp_init(void);
 
+/* tcp_recovery.c */
+
+extern void tcp_rack_advance(struct tcp_sock *tp,
+const struct skb_mstamp *xmit_time, u8 sacked);
+
 /*
  * Save and compile IPv4 options, return a pointer to it
  */
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 89aacb6..c29809f 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -8,6 +8,7 @@ obj-y := route.o inetpeer.o protocol.o \
 inet_timewait_sock.o inet_connection_sock.o \
 tcp.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o \
 tcp_minisocks.o tcp_cong.o tcp_metrics.o tcp_fastopen.o \
+tcp_recovery.o \
 tcp_offload.o datagram.o raw.o udp.o udplite.o \
 udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
 fib_frontend.o fib_semantics.o fib_trie.o \
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c304b5f..21a9ea4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1173,6 +1173,8 @@ static u8 tcp_sacktag_one(struct sock *sk,
return sacked;
 
if (!(sacked & TCPCB_SACKED_ACKED)) {
+   tcp_rack_advance(tp, xmit_time, sacked);
+
if (sacked & TCPCB_SACKED_RETRANS) {
/* If the segment is not tagged as lost,
 * we do not clear RETRANS, believing
@@ -2256,6 +2258,16 

[PATCH net-next 2/7] tcp: track min RTT using windowed min-filter

2015-10-16 Thread Yuchung Cheng
Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.

The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.

Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.

Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v <= 2nd.v <=
3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
now). These invariants determine the structure of the code

The RTT input to the windowed filter is the minimum RTT measured
from ACK or SACK, or as the last resort from TCP timestamps.

The accessor tcp_min_rtt() returns the minimum RTT seen in the
window. ~0U indicates it is not available. The minimum is 1usec
even if the true RTT is below that.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
---
 Documentation/networking/ip-sysctl.txt |  8 
 include/linux/tcp.h|  3 ++
 include/net/tcp.h  |  7 +++
 net/ipv4/sysctl_net_ipv4.c |  7 +++
 net/ipv4/tcp.c |  1 +
 net/ipv4/tcp_input.c   | 78 +++---
 net/ipv4/tcp_minisocks.c   |  1 +
 7 files changed, 100 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index ebe94f2..502d6a5 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -384,6 +384,14 @@ tcp_mem - vector of 3 INTEGERs: min, pressure, max
Defaults are calculated at boot time from amount of available
memory.
 
+tcp_min_rtt_wlen - INTEGER
+   The window length of the windowed min filter to track the minimum RTT.
+   A shorter window lets a flow more quickly pick up new (higher)
+   minimum RTT when it is moved to a longer path (e.g., due to traffic
+   engineering). A longer window makes the filter more resistant to RTT
+   inflations such as transient congestion. The unit is seconds.
+   Default: 300
+
 tcp_moderate_rcvbuf - BOOLEAN
If set, TCP performs receive buffer auto-tuning, attempting to
automatically size the buffer (no greater than tcp_rmem[2]) to
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 86a7eda..90edef5 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -217,6 +217,9 @@ struct tcp_sock {
u32 mdev_max_us;/* maximal mdev for the last rtt period */
u32 rttvar_us;  /* smoothed mdev_max*/
u32 rtt_seq;/* sequence number to update rttvar */
+   struct rtt_meas {
+   u32 rtt, ts;/* RTT in usec and sampling time in jiffies. */
+   } rtt_min[3];
 
u32 packets_out;/* Packets which are "in flight"*/
u32 retrans_out;/* Retransmitted packets out*/
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a6be56d..4575f0e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -279,6 +279,7 @@ extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
 extern unsigned int sysctl_tcp_notsent_lowat;
 extern int sysctl_tcp_min_tso_segs;
+extern int sysctl_tcp_min_rtt_wlen;
 extern int sysctl_tcp_autocorking;
 extern int sysctl_tcp_invalid_ratelimit;
 extern int sysctl_tcp_pacing_ss_ratio;
@@ -671,6 +672,12 @@ static inline bool tcp_ca_dst_locked(const struct 
dst_entry *dst)
return dst_metric_locked(dst, RTAX_CC_ALGO);
 }
 
+/* Minimum RTT in usec. ~0 means not available. */
+static inline u32 tcp_min_rtt(const struct tcp_sock *tp)
+{
+   return tp->rtt_min[0].rtt;
+}
+
 /* Compute the actual receive window we are currently advertising.
  * Rcv_nxt can be after the window if our peer push more data
  * than the offered window.
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 894da3a..13ab434 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -577,6 +577,13 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec
},
{
+   .procname   = "tcp_min_rtt_

[PATCH net-next 7/7] tcp: use RACK to detect losses

2015-10-16 Thread Yuchung Cheng
This patch implements the second half of RACK that uses the the most
recent transmit time among all delivered packets to detect losses.

tcp_rack_mark_lost() is called upon receiving a dubious ACK.
It then checks if an not-yet-sacked packet was sent at least
"reo_wnd" prior to the sent time of the most recently delivered.
If so the packet is deemed lost.

The "reo_wnd" reordering window starts with 1msec for fast loss
detection and changes to min-RTT/4 when reordering is observed.
We found 1msec accommodates well on tiny degree of reordering
(<3 pkts) on faster links. We use min-RTT instead of SRTT because
reordering is more of a path property but SRTT can be inflated by
self-inflicated congestion. The factor of 4 is borrowed from the
delayed early retransmit and seems to work reasonably well.

Since RACK is still experimental, it is now used as a supplemental
loss detection on top of existing algorithms. It is only effective
after the fast recovery starts or after the timeout occurs. The
fast recovery is still triggered by FACK and/or dupack threshold
instead of RACK.

We introduce a new sysctl net.ipv4.tcp_recovery for future
experiments of loss recoveries. For now RACK can be disabled by
setting it to 0.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
---
 Documentation/networking/ip-sysctl.txt |  9 
 include/net/tcp.h  |  9 
 net/ipv4/sysctl_net_ipv4.c |  7 
 net/ipv4/tcp_input.c   |  9 +++-
 net/ipv4/tcp_recovery.c| 77 ++
 5 files changed, 109 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 502d6a5..85752c8 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -433,6 +433,15 @@ tcp_orphan_retries - INTEGER
you should think about lowering this value, such sockets
may consume significant resources. Cf. tcp_max_orphans.
 
+tcp_recovery - INTEGER
+   This value is a bitmap to enable various experimental loss recovery
+   features.
+
+   RACK: 0x1 enables the RACK loss detection for fast detection of lost
+ retransmissions and tail drops.
+
+   Default: 0x1
+
 tcp_reordering - INTEGER
Initial reordering level of packets in a TCP stream.
TCP stack can then dynamically adjust flow reordering level
diff --git a/include/net/tcp.h b/include/net/tcp.h
index aee5f23..f615789 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -567,6 +567,7 @@ void tcp_resume_early_retransmit(struct sock *sk);
 void tcp_rearm_rto(struct sock *sk);
 void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req);
 void tcp_reset(struct sock *sk);
+void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
 
 /* tcp_timer.c */
 void tcp_init_xmit_timers(struct sock *);
@@ -1752,6 +1753,14 @@ void tcp_init(void);
 
 /* tcp_recovery.c */
 
+/* Flags to enable various loss recovery features. See below */
+extern int sysctl_tcp_recovery;
+
+/* Use TCP RACK to detect (some) tail and retransmit losses */
+#define TCP_RACK_LOST_RETRANS  0x1
+
+extern int tcp_rack_mark_lost(struct sock *sk);
+
 extern void tcp_rack_advance(struct tcp_sock *tp,
 const struct skb_mstamp *xmit_time, u8 sacked);
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 13ab434..25300c5 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -496,6 +496,13 @@ static struct ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec
},
{
+   .procname   = "tcp_recovery",
+   .data   = &sysctl_tcp_recovery,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
.procname   = "tcp_reordering",
.data   = &sysctl_tcp_reordering,
.maxlen = sizeof(int),
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 21a9ea4..cf0f954 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -881,6 +881,7 @@ static void tcp_update_reordering(struct sock *sk, const 
int metric,
 
if (metric > 0)
tcp_disable_early_retrans(tp);
+   tp->rack.reord = 1;
 }
 
 /* This must be called before lost_out is incremented */
@@ -906,8 +907,7 @@ static void tcp_skb_mark_lost(struct tcp_sock *tp, struct 
sk_buff *skb)
}
 }
 
-static void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp,
-   struct sk_buff *skb)
+void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb)
 {
tcp_verify_retransmit_hint(tp, skb);
 
@@ -2806,6 +2806,11 @@ static void tcp_fastretrans_alert(struct sock *sk, const 
int acked,
   

[PATCH net-next 0/7] RACK loss detection

2015-10-16 Thread Yuchung Cheng
RACK (Recent ACK) loss recovery uses the notion of time instead of
packet sequence (FACK) or counts (dupthresh).

It's inspired by the FACK heuristic in tcp_mark_lost_retrans(): when a
limited transmit (new data packet) is sacked in recovery, then any
retransmission sent before that newly sacked packet was sent must have
been lost, since at least one round trip time has elapsed.

But that existing heuristic from tcp_mark_lost_retrans()
has several limitations:
  1) it can't detect tail drops since it depends on limited transmit
  2) it's disabled upon reordering (assumes no reordering)
  3) it's only enabled in fast recovery but not timeout recovery

RACK addresses these limitations with a core idea: an unacknowledged
packet P1 is deemed lost if a packet P2 that was sent later is is
s/acked, since at least one round trip has passed.

Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when a later retransmission is
s/acked, while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering, similar to the delayed
Early Retransmit.

In the current patch set RACK is only a supplemental loss detection
and does not trigger fast recovery. However we are developing RACK
to replace or consolidate FACK/dupthresh, early retransmit, and
thin-dupack. These heuristics all implicitly bear the time notion.
For example, the delayed Early Retransmit is simply applying RACK
to trigger the fast recovery with small inflight.

RACK requires measuring the minimum RTT. Tracking a global min is less
robust due to traffic engineering pathing changes. Therefore it uses a
windowed filter by Kathleen Nichols. The min RTT can also be useful
for various other purposes like congestion control or stat monitoring.

This patch has been used on Google servers for well over 1 year. RACK
has also been implemented in the QUIC protocol. We are submitting an
IETF draft as well.

Yuchung Cheng (7):
  tcp: apply Kern's check on RTTs used for congestion control
  tcp: track min RTT using windowed min-filter
  tcp: remove tcp_mark_lost_retrans()
  tcp: add tcp_tsopt_ecr_before helper
  tcp: skb_mstamp_after helper
  tcp: track the packet timings in RACK
  tcp: use RACK to detect losses

 Documentation/networking/ip-sysctl.txt |  17 
 include/linux/skbuff.h |   9 ++
 include/linux/tcp.h|  11 +-
 include/net/tcp.h  |  21 
 net/ipv4/Makefile  |   1 +
 net/ipv4/sysctl_net_ipv4.c |  14 +++
 net/ipv4/tcp.c |   1 +
 net/ipv4/tcp_input.c   | 180 +++--
 net/ipv4/tcp_minisocks.c   |   3 +
 net/ipv4/tcp_output.c  |   6 --
 net/ipv4/tcp_recovery.c| 109 
 11 files changed, 286 insertions(+), 86 deletions(-)
 create mode 100644 net/ipv4/tcp_recovery.c

-- 
2.6.0.rc2.230.g3dd15c0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 3/7] tcp: remove tcp_mark_lost_retrans()

2015-10-16 Thread Yuchung Cheng
Remove the existing lost retransmit detection because RACK subsumes
it completely. This also stops the overloading the ack_seq field of
the skb control block.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
---
 include/linux/tcp.h   |  2 --
 net/ipv4/tcp_input.c  | 65 ---
 net/ipv4/tcp_output.c |  6 -
 3 files changed, 73 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 90edef5..8c54863 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -283,8 +283,6 @@ struct tcp_sock {
int lost_cnt_hint;
u32 retransmit_high;/* L-bits may be on up to this seqno */
 
-   u32 lost_retrans_low;   /* Sent seq after any rxmit (lowest) */
-
u32 prior_ssthresh; /* ssthresh saved at recovery start */
u32 high_seq;   /* snd_nxt at onset of congestion   */
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e177386..5ad08d8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1048,70 +1048,6 @@ static bool tcp_is_sackblock_valid(struct tcp_sock *tp, 
bool is_dsack,
return !before(start_seq, end_seq - tp->max_window);
 }
 
-/* Check for lost retransmit. This superb idea is borrowed from "ratehalving".
- * Event "B". Later note: FACK people cheated me again 8), we have to account
- * for reordering! Ugly, but should help.
- *
- * Search retransmitted skbs from write_queue that were sent when snd_nxt was
- * less than what is now known to be received by the other end (derived from
- * highest SACK block). Also calculate the lowest snd_nxt among the remaining
- * retransmitted skbs to avoid some costly processing per ACKs.
- */
-static void tcp_mark_lost_retrans(struct sock *sk, int *flag)
-{
-   const struct inet_connection_sock *icsk = inet_csk(sk);
-   struct tcp_sock *tp = tcp_sk(sk);
-   struct sk_buff *skb;
-   int cnt = 0;
-   u32 new_low_seq = tp->snd_nxt;
-   u32 received_upto = tcp_highest_sack_seq(tp);
-
-   if (!tcp_is_fack(tp) || !tp->retrans_out ||
-   !after(received_upto, tp->lost_retrans_low) ||
-   icsk->icsk_ca_state != TCP_CA_Recovery)
-   return;
-
-   tcp_for_write_queue(skb, sk) {
-   u32 ack_seq = TCP_SKB_CB(skb)->ack_seq;
-
-   if (skb == tcp_send_head(sk))
-   break;
-   if (cnt == tp->retrans_out)
-   break;
-   if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
-   continue;
-
-   if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS))
-   continue;
-
-   /* TODO: We would like to get rid of tcp_is_fack(tp) only
-* constraint here (see above) but figuring out that at
-* least tp->reordering SACK blocks reside between ack_seq
-* and received_upto is not easy task to do cheaply with
-* the available datastructures.
-*
-* Whether FACK should check here for tp->reordering segs
-* in-between one could argue for either way (it would be
-* rather simple to implement as we could count fack_count
-* during the walk and do tp->fackets_out - fack_count).
-*/
-   if (after(received_upto, ack_seq)) {
-   TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
-   tp->retrans_out -= tcp_skb_pcount(skb);
-   *flag |= FLAG_LOST_RETRANS;
-   tcp_skb_mark_lost_uncond_verify(tp, skb);
-   NET_INC_STATS_BH(sock_net(sk), 
LINUX_MIB_TCPLOSTRETRANSMIT);
-   } else {
-   if (before(ack_seq, new_low_seq))
-   new_low_seq = ack_seq;
-   cnt += tcp_skb_pcount(skb);
-   }
-   }
-
-   if (tp->retrans_out)
-   tp->lost_retrans_low = new_low_seq;
-}
-
 static bool tcp_check_dsack(struct sock *sk, const struct sk_buff *ack_skb,
struct tcp_sack_block_wire *sp, int num_sacks,
u32 prior_snd_una)
@@ -1838,7 +1774,6 @@ advance_sp:
((inet_csk(sk)->icsk_ca_state != TCP_CA_Loss) || tp->undo_marker))
tcp_update_reordering(sk, tp->fackets_out - state->reord, 0);
 
-   tcp_mark_lost_retrans(sk, &state->flag);
tcp_verify_left_out(tp);
 out:
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6e79fcb..49730e5 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2655,8 +2655,6 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb)
net_dbg_ratelimited("retrans_out leaked\n");
}
 #endif
-   if (!tp->retrans_out)
-   tp->lost_retrans_l

[PATCH net-next 1/7] tcp: apply Kern's check on RTTs used for congestion control

2015-10-16 Thread Yuchung Cheng
Currently ca_seq_rtt_us does not use Kern's check. Fix that by
checking if any packet acked is a retransmit, for both RTT used
for RTT estimation and congestion control.

Fixes: 5b08e47ca ("tcp: prefer packet timing to TS-ECR for RTT")
Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_input.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3b35c3f..38743e5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2925,9 +2925,6 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, 
const int flag,
 * Karn's algorithm forbids taking RTT if some retransmitted data
 * is acked (RFC6298).
 */
-   if (flag & FLAG_RETRANS_DATA_ACKED)
-   seq_rtt_us = -1L;
-
if (seq_rtt_us < 0)
seq_rtt_us = sack_rtt_us;
 
@@ -3169,7 +3166,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int 
prior_fackets,
flag |= FLAG_SACK_RENEGING;
 
skb_mstamp_get(&now);
-   if (likely(first_ackt.v64)) {
+   if (likely(first_ackt.v64) && !(flag & FLAG_RETRANS_DATA_ACKED)) {
seq_rtt_us = skb_mstamp_us_delta(&now, &first_ackt);
ca_rtt_us = skb_mstamp_us_delta(&now, &last_ackt);
}
-- 
2.6.0.rc2.230.g3dd15c0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] drivers: net: cpsw: add phy-handle parsing

2015-10-16 Thread Heiko Schocher
add the ability to parse "phy-handle". This
is needed for phys, which have a DT node, and
need to parse DT properties.

Signed-off-by: Heiko Schocher 
---

Changes in v2: None

 Documentation/devicetree/bindings/net/cpsw.txt |  1 +
 drivers/net/ethernet/ti/cpsw.c | 15 +++
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/cpsw.txt 
b/Documentation/devicetree/bindings/net/cpsw.txt
index a9df21a..a2cae4e 100644
--- a/Documentation/devicetree/bindings/net/cpsw.txt
+++ b/Documentation/devicetree/bindings/net/cpsw.txt
@@ -39,6 +39,7 @@ Required properties:
 Optional properties:
 - dual_emac_res_vlan   : Specifies VID to be used to segregate the ports
 - mac-address  : See ethernet.txt file in the same directory
+- phy-handle   : See ethernet.txt file in the same directory
 
 Note: "ti,hwmods" field is used to fetch the base address and irq
 resources from TI, omap hwmod data base during device registration.
diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 8fc90f1..874fb29 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -365,6 +366,7 @@ struct cpsw_priv {
spinlock_t  lock;
struct platform_device  *pdev;
struct net_device   *ndev;
+   struct device_node  *phy_node;
struct napi_struct  napi_rx;
struct napi_struct  napi_tx;
struct device   *dev;
@@ -1145,7 +1147,11 @@ static void cpsw_slave_open(struct cpsw_slave *slave, 
struct cpsw_priv *priv)
cpsw_ale_add_mcast(priv->ale, priv->ndev->broadcast,
   1 << slave_port, 0, 0, ALE_MCAST_FWD_2);
 
-   slave->phy = phy_connect(priv->ndev, slave->data->phy_id,
+   if (priv->phy_node)
+   slave->phy = of_phy_connect(priv->ndev, priv->phy_node,
+&cpsw_adjust_link, 0, slave->data->phy_if);
+   else
+   slave->phy = phy_connect(priv->ndev, slave->data->phy_id,
 &cpsw_adjust_link, slave->data->phy_if);
if (IS_ERR(slave->phy)) {
dev_err(priv->dev, "phy %s not found on slave %d\n",
@@ -1934,11 +1940,12 @@ static void cpsw_slave_init(struct cpsw_slave *slave, 
struct cpsw_priv *priv,
slave->port_vlan = data->dual_emac_res_vlan;
 }
 
-static int cpsw_probe_dt(struct cpsw_platform_data *data,
+static int cpsw_probe_dt(struct cpsw_priv *priv,
 struct platform_device *pdev)
 {
struct device_node *node = pdev->dev.of_node;
struct device_node *slave_node;
+   struct cpsw_platform_data *data = &priv->data;
int i = 0, ret;
u32 prop;
 
@@ -2029,6 +2036,7 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
if (strcmp(slave_node->name, "slave"))
continue;
 
+   priv->phy_node = of_parse_phandle(slave_node, "phy-handle", 0);
parp = of_get_property(slave_node, "phy_id", &lenp);
if ((parp == NULL) || (lenp != (sizeof(void *) * 2))) {
dev_err(&pdev->dev, "Missing slave[%d] phy_id 
property\n", i);
@@ -2044,7 +2052,6 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
}
snprintf(slave_data->phy_id, sizeof(slave_data->phy_id),
 PHY_ID_FMT, mdio->name, phyid);
-
slave_data->phy_if = of_get_phy_mode(slave_node);
if (slave_data->phy_if < 0) {
dev_err(&pdev->dev, "Missing or malformed slave[%d] 
phy-mode property\n",
@@ -2240,7 +2247,7 @@ static int cpsw_probe(struct platform_device *pdev)
/* Select default pin state */
pinctrl_pm_select_default_state(&pdev->dev);
 
-   if (cpsw_probe_dt(&priv->data, pdev)) {
+   if (cpsw_probe_dt(priv, pdev)) {
dev_err(&pdev->dev, "cpsw: platform data missing\n");
ret = -ENODEV;
goto clean_runtime_disable_ret;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] net: phy: smsc: disable energy detect mode

2015-10-16 Thread Heiko Schocher
On some boards the energy enable detect mode leads in
trouble with some switches, so make the enabling of
this mode configurable through DT.

Signed-off-by: Heiko Schocher 
---

Changes in v2:
- add comments from Florian Fainelli
  - I did not change disable property name into enable
because I fear to break existing behaviour
  - add smsc vendor prefix
  - remove CONFIG_OF and use __maybe_unused
  - introduce "phy-handle" ability into ti,cpsw
driver, so I can remove bogus:
  if (!of_node && dev->parent->of_node)
  of_node = dev->parent->of_node;
construct. Therefore new patch for the ti,cpsw
driver is necessary.

 .../devicetree/bindings/net/smsc-lan87xx.txt   | 24 ++
 drivers/net/phy/smsc.c | 19 -
 2 files changed, 38 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/smsc-lan87xx.txt

diff --git a/Documentation/devicetree/bindings/net/smsc-lan87xx.txt 
b/Documentation/devicetree/bindings/net/smsc-lan87xx.txt
new file mode 100644
index 000..974edd5
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/smsc-lan87xx.txt
@@ -0,0 +1,24 @@
+SMSC LAN87xx Ethernet PHY
+
+Some boards require special tuning values. Configure them
+through an Ethernet OF device node.
+
+Optional properties:
+
+- smsc,disable-energy-detect:
+  If set, do not enable energy detect mode for the SMSC phy.
+  default: enable energy detect mode
+
+Examples:
+smsc phy with disabled energy detect mode on an am335x based board.
+&davinci_mdio {
+   pinctrl-names = "default", "sleep";
+   pinctrl-0 = <&davinci_mdio_default>;
+   pinctrl-1 = <&davinci_mdio_sleep>;
+   status = "okay";
+
+   ethernetphy0: ethernet-phy@0 {
+   reg = <0>;
+   smsc,disable-energy-detect;
+   };
+};
diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c
index 70b0895..dc2da87 100644
--- a/drivers/net/phy/smsc.c
+++ b/drivers/net/phy/smsc.c
@@ -43,16 +43,25 @@ static int smsc_phy_ack_interrupt(struct phy_device *phydev)
 
 static int smsc_phy_config_init(struct phy_device *phydev)
 {
+   int __maybe_unused len;
+   struct device *dev __maybe_unused = &phydev->dev;
+   struct device_node *of_node __maybe_unused = dev->of_node;
int rc = phy_read(phydev, MII_LAN83C185_CTRL_STATUS);
+   int enable_energy = 1;
 
if (rc < 0)
return rc;
 
-   /* Enable energy detect mode for this SMSC Transceivers */
-   rc = phy_write(phydev, MII_LAN83C185_CTRL_STATUS,
-  rc | MII_LAN83C185_EDPWRDOWN);
-   if (rc < 0)
-   return rc;
+   if (of_find_property(of_node, "smsc,disable-energy-detect", &len))
+   enable_energy = 0;
+
+   if (enable_energy) {
+   /* Enable energy detect mode for this SMSC Transceivers */
+   rc = phy_write(phydev, MII_LAN83C185_CTRL_STATUS,
+  rc | MII_LAN83C185_EDPWRDOWN);
+   if (rc < 0)
+   return rc;
+   }
 
return smsc_phy_ack_interrupt(phydev);
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/2] net, phy, smsc: add posibility to disable energy detect mode

2015-10-16 Thread Heiko Schocher
On some boards the energy enable detect mode leads in
trouble with some switches, so make the enabling of
this mode configurable through DT.
Therefore the property "smsc,disable-energy-detect" is
introduced.

Patch 1 introduces phy-handle support for the ti,cpsw
driver. This is needed now for the smsc phy.

Patch 2 adds the disable energy mode functionality
to the smsc phy

Changes in v2:
- add comments from Florian Fainelli
  - I did not change disable property name into enable
because I fear to break existing behaviour
  - add smsc vendor prefix
  - remove CONFIG_OF and use __maybe_unused
  - introduce "phy-handle" ability into ti,cpsw
driver, so I can remove bogus:
  if (!of_node && dev->parent->of_node)
  of_node = dev->parent->of_node;
construct. Therefore new patch for the ti,cpsw
driver is necessary.

Heiko Schocher (2):
  drivers: net: cpsw: add phy-handle parsing
  net: phy: smsc: disable energy detect mode

 Documentation/devicetree/bindings/net/cpsw.txt |  1 +
 .../devicetree/bindings/net/smsc-lan87xx.txt   | 24 ++
 drivers/net/ethernet/ti/cpsw.c | 15 ++
 drivers/net/phy/smsc.c | 19 -
 4 files changed, 50 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/smsc-lan87xx.txt

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: phy: smsc: disable energy detect mode

2015-10-16 Thread Heiko Schocher

Hello Florian,

Am 16.10.2015 um 18:27 schrieb Florian Fainelli:

2015-10-13 21:17 GMT-07:00 Heiko Schocher :

Hello Florian,


Am 13.10.2015 um 21:26 schrieb Florian Fainelli:


On 12/10/15 22:13, Heiko Schocher wrote:


On some boards the energy enable detect mode leads in
trouble with some switches, so make the enabling of
this mode configurable through DT.

Signed-off-by: Heiko Schocher 
---

   .../devicetree/bindings/net/smsc-lan87xx.txt   | 19
+
   drivers/net/phy/smsc.c | 24
+-
   2 files changed, 38 insertions(+), 5 deletions(-)
   create mode 100644
Documentation/devicetree/bindings/net/smsc-lan87xx.txt

diff --git a/Documentation/devicetree/bindings/net/smsc-lan87xx.txt
b/Documentation/devicetree/bindings/net/smsc-lan87xx.txt
new file mode 100644
index 000..39aa1dc
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/smsc-lan87xx.txt
@@ -0,0 +1,19 @@
+SMSC LAN87xx Ethernet PHY
+
+Some boards require special tuning values. Configure them
+through an Ethernet OF device node.
+
+Optional properties:
+
+- disable-energy-detect:
+  If set, do not enable energy detect mode for the SMSC phy.
+  default: enable energy detect mode



Although energy detection is something that is implemented by many PHYs,
I am not sure a generic property is suitable here, I would prefix that
with the SMSC vendor prefix here to make it clear this only applies to
this PHY.



Hmm... but all PHYs should be able to enable, disable it in some way, or?


It may not always be controlled directly at the PHY level, sometimes
this is something that needs cooperation with the Ethernet MAC as well
in case of integrated designs.


Ah, ok!


Would not you want to make it a reverse property here though, something
like this:

smsc,energy-detect: boolean, when present indicates the PHY reliably
supports energy detection



Yes, that was also my first thought, but currently, on this PHYs
energy detect mode is on ... and if I introduce such a property,
it will disable it for all existing boards, because property is
missing ... so, maybe I break boards ...


Fair enough, how about smsc,disabled-energy-detect or something like that then?


Yes, changed it to "smsc,disable-energy-detect"

bye,
Heiko
--
DENX Software Engineering GmbH,  Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Alexei Starovoitov

On 10/16/15 4:44 PM, Eric W. Biederman wrote:

Alexei Starovoitov  writes:


We can argue about api for 2nd, whether it's mount with fd=1234 string
or else, but for the first mount style doesn't make sense.


Why does mount not make sense?  It is exactly what you are looking for
so why does it not make sense?


hmm, how do you get a new fd back after mounting it?
Note, open cannot be overloaded, so we end up with BPF_NEW_FD anyway,
but now it's more convoluted and empty mounts are everywhere.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 net-next] RDS: fix rds-ping deadlock over TCP transport

2015-10-16 Thread Santosh Shilimkar
Sowmini found hang with rds-ping while testing RDS over TCP. Its
a corner case and doesn't happen always. The issue is not reproducible
with IB transport. Its clear from below dump why we see it with RDS TCP.

 [] do_tcp_setsockopt+0xb5/0x740
 [] tcp_setsockopt+0x24/0x30
 [] sock_common_setsockopt+0x14/0x20
 [] rds_tcp_xmit_prepare+0x5d/0x70 [rds_tcp]
 [] rds_send_xmit+0xd7/0x740 [rds]
 [] rds_send_pong+0x142/0x180 [rds]
 [] rds_recv_incoming+0x274/0x330 [rds]
 [] ? ttwu_queue+0x11e/0x130
 [] ? skb_copy_bits+0x6d/0x2c0
 [] rds_tcp_data_recv+0x2f0/0x3d0 [rds_tcp]
 [] tcp_read_sock+0x96/0x1c0
 [] ? rds_tcp_recv_init+0x40/0x40 [rds_tcp]
 [] ? sock_def_write_space+0xa0/0xa0
 [] rds_tcp_data_ready+0xa1/0xf0 [rds_tcp]
 [] tcp_data_queue+0x379/0x5b0
 [] ? rds_tcp_write_space+0xbb/0x110 [rds_tcp]
 [] tcp_rcv_established+0x2e2/0x6e0
 [] tcp_v4_do_rcv+0x122/0x220
 [] tcp_v4_rcv+0x867/0x880
 [] ip_local_deliver_finish+0xa3/0x220

This happens because rds_send_xmit() chain wants to take
sock_lock which is already taken by tcp_v4_rcv() on its
way to rds_tcp_data_ready(). Commit db6526dcb51b ("RDS: use
rds_send_xmit() state instead of RDS_LL_SEND_FULL") which
was trying to opportunistically finish the send request
in same thread context.

But because of above recursive lock hang with RDS TCP,
the send work from rds_send_pong() needs to deferred to
worker to avoid lock up. Given RDS ping is more of connectivity
test than performance critical path, its should be ok even
for transport like IB.

Reported-by: Sowmini Varadhan 
Acked-by:  Sowmini Varadhan 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
v2: Dropped the confusing SEND_LL_FULL check from v1

 net/rds/send.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index ee49c25..827155c 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -1182,9 +1182,8 @@ rds_send_pong(struct rds_connection *conn, __be16 dport)
rds_stats_inc(s_send_queued);
rds_stats_inc(s_send_pong);
 
-   ret = rds_send_xmit(conn);
-   if (ret == -ENOMEM || ret == -EAGAIN)
-   queue_delayed_work(rds_wq, &conn->c_send_w, 1);
+   /* schedule the send work on rds_wq */
+   queue_delayed_work(rds_wq, &conn->c_send_w, 1);
 
rds_message_put(rm);
return 0;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch net-next v5 3/8] switchdev: allow caller to explicitly request attr_set as deferred

2015-10-16 Thread John Fastabend
On 15-10-16 01:23 AM, Jiri Pirko wrote:
> Thu, Oct 15, 2015 at 05:21:22PM CEST, john.fastab...@gmail.com wrote:
>> On 15-10-14 10:40 AM, Jiri Pirko wrote:
>>> From: Jiri Pirko 
>>>
>>> Caller should know if he can call attr_set directly (when holding RTNL)
>>> or if he has to defer the att_set processing for later.
>>>
>>> This also allows drivers to sleep inside attr_set and report operation
>>> status back to switchdev core. Switchdev core then warns if status is
>>> not ok, instead of silent errors happening in drivers.
>>>
>>> Benefit from newly introduced switchdev deferred ops infrastructure.
>>>
>>
>> A nit but the patch description should note your setting the defer bit
>> on the bridge set state.
>>
>>> Signed-off-by: Jiri Pirko 
>>> ---
>>>  include/net/switchdev.h   |   1 +
>>>  net/bridge/br_stp.c   |   3 +-
>>>  net/switchdev/switchdev.c | 108 
>>> ++
>>>  3 files changed, 46 insertions(+), 66 deletions(-)
>>>
>>> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>>> index d1c7f90..f7de6f8 100644
>>> --- a/include/net/switchdev.h
>>> +++ b/include/net/switchdev.h
>>> @@ -17,6 +17,7 @@
>>>  
>>>  #define SWITCHDEV_F_NO_RECURSE BIT(0)
>>>  #define SWITCHDEV_F_SKIP_EOPNOTSUPPBIT(1)
>>> +#define SWITCHDEV_F_DEFER  BIT(2)
>>>  
>>>  struct switchdev_trans_item {
>>> struct list_head list;
>>> diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
>>> index db6d243de..80c34d7 100644
>>> --- a/net/bridge/br_stp.c
>>> +++ b/net/bridge/br_stp.c
>>> @@ -41,13 +41,14 @@ void br_set_state(struct net_bridge_port *p, unsigned 
>>> int state)
>>>  {
>>> struct switchdev_attr attr = {
>>> .id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
>>> +   .flags = SWITCHDEV_F_DEFER,
>>> .u.stp_state = state,
>>> };
>>
>>
>> This creates a possible race (with 6/8) I think, please check!
> 
> Wait. This patch does not change the previous behaviour. Patch 6 does,
> so I don't understand why you are asking here. Confusing.
> 

Sorry if its confusing I keyed of the addition of the SWITCHDEV_F_DEFER
here.

> 
>>
>> In del_nbp() we call br_stp_disable_port() to set the port state
>> to BR_STATE_DISABLE and disabling learning events. But with this
>> patch it can be deferred. Also note the STP agent may be in userspace
>> which actually seems more likely the case because you likely want to
>> run some more modern variant of STP than the kernel supports.
>>
>> So at some point in the future the driver will turn off learning. At
>> the same time we call br_fdb_delete_by_port which calls a deferred
>> set of fdb deletes.
>>
>> I don't see how you guarantee learning is off before you start doing
>> the deletes here and possibly learning new addresses after the software
>> side believes the port is down.
>>
>> So
>>
>>   br_stp_disable_port
>>   br_fdb_delete_by_port
>>   {fdb_del_external_learn}
>>   [hw learns a fdb]
>>   [hw disables learning]
>>
>> What stops this from happening?
> 
> Okay. This behaviour is the same as without the patchset. What would
> resolve the issue it to put switchdev_deferred_process() after
> br_stp_disable_port() and before br_fdb_delete_by_port() call.
> That would enforce stp change to happen in hw before fdbs are explicitly
> deleted. Sound good to you?

OK so putting the switchdev_deferred_process() between the disable_port
and the delete_by_port will enforce the stp change to happen in hw
before the fdbs are explicitly deleted. I think this is minimally
required. I don't like scattering these flush_workqueue() calls all
over the place but I don't have any better ideas right now so sounds
good enough.

But now I'm wondering if you can have a deferred fdb add in the rocker
driver (rocker_port_fdb_learn_work) running in parallel with this that
could happen after the delete and add a bogus fdb entry. I think you
also need to have a flush in rocker_port_stp_update() to handle this
case.

Also I agree these issues were not completely caused by your patches.

Thanks,
John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] RDS: fix rds-ping deadlock over TCP transport

2015-10-16 Thread santosh shilimkar

On 10/16/2015 6:45 PM, Sowmini Varadhan wrote:

On (10/16/15 20:26), Santosh Shilimkar wrote:


diff --git a/net/rds/send.c b/net/rds/send.c
+   if (!test_bit(RDS_LL_SEND_FULL, &conn->c_flags))
+   queue_delayed_work(rds_wq, &conn->c_send_w, 0);


A minor note- it would help to add some comments here explaining
that the pong has already been added to the sendq earlier..
in the case of IB, if RDS_LL_SEND_FULL has been set, it takes some
head-scratching to figure out how the pong gets sent, and a  few
comments could help clarify that.


Right. The check confused you. I will update the patch and drop
the check all together and add a jiffies for the queuing which
should take care of it. Will post v2 with that update.


but other than that, the contents look good to me, thus

Acked-by: Sowmini Varadhan 


Thanks !!

regards,
Santosh
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] RDS: fix rds-ping deadlock over TCP transport

2015-10-16 Thread Sowmini Varadhan
On (10/16/15 20:26), Santosh Shilimkar wrote:
> 
> diff --git a/net/rds/send.c b/net/rds/send.c
> + if (!test_bit(RDS_LL_SEND_FULL, &conn->c_flags))
> + queue_delayed_work(rds_wq, &conn->c_send_w, 0);

A minor note- it would help to add some comments here explaining 
that the pong has already been added to the sendq earlier.. 
in the case of IB, if RDS_LL_SEND_FULL has been set, it takes some
head-scratching to figure out how the pong gets sent, and a  few
comments could help clarify that.

but other than that, the contents look good to me, thus

Acked-by: Sowmini Varadhan 



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] RDS: fix rds-ping deadlock over TCP transport

2015-10-16 Thread Santosh Shilimkar
Sowmini found hang with rds-ping while testing RDS over TCP. Its
a corner case and doesn't happen always. The issue is not reproducible
with IB transport. Its clear from below dump why we see it with RDS TCP.

 [] do_tcp_setsockopt+0xb5/0x740
 [] tcp_setsockopt+0x24/0x30
 [] sock_common_setsockopt+0x14/0x20
 [] rds_tcp_xmit_prepare+0x5d/0x70 [rds_tcp]
 [] rds_send_xmit+0xd7/0x740 [rds]
 [] rds_send_pong+0x142/0x180 [rds]
 [] rds_recv_incoming+0x274/0x330 [rds]
 [] ? ttwu_queue+0x11e/0x130
 [] ? skb_copy_bits+0x6d/0x2c0
 [] rds_tcp_data_recv+0x2f0/0x3d0 [rds_tcp]
 [] tcp_read_sock+0x96/0x1c0
 [] ? rds_tcp_recv_init+0x40/0x40 [rds_tcp]
 [] ? sock_def_write_space+0xa0/0xa0
 [] rds_tcp_data_ready+0xa1/0xf0 [rds_tcp]
 [] tcp_data_queue+0x379/0x5b0
 [] ? rds_tcp_write_space+0xbb/0x110 [rds_tcp]
 [] tcp_rcv_established+0x2e2/0x6e0
 [] tcp_v4_do_rcv+0x122/0x220
 [] tcp_v4_rcv+0x867/0x880
 [] ip_local_deliver_finish+0xa3/0x220

This happens because rds_send_xmit() chain wants to take
sock_lock which is already taken by tcp_v4_rcv() on its
way to rds_tcp_data_ready(). Commit db6526dcb51b ("RDS: use
rds_send_xmit() state instead of RDS_LL_SEND_FULL") which
was trying to opportunistically finish the send request
in same thread context.

But because of above recursive lock hang with RDS TCP,
the send work from rds_send_pong() needs to deferred to
worker to avoid lock up. Given RDS ping is more of connectivity
test than performance critical path, its should be ok even
for transport like IB.

Reported-by: Sowmini Varadhan 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/send.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index ee49c25..7a377a1 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -1182,9 +1182,8 @@ rds_send_pong(struct rds_connection *conn, __be16 dport)
rds_stats_inc(s_send_queued);
rds_stats_inc(s_send_pong);
 
-   ret = rds_send_xmit(conn);
-   if (ret == -ENOMEM || ret == -EAGAIN)
-   queue_delayed_work(rds_wq, &conn->c_send_w, 1);
+   if (!test_bit(RDS_LL_SEND_FULL, &conn->c_flags))
+   queue_delayed_work(rds_wq, &conn->c_send_w, 0);
 
rds_message_put(rm);
return 0;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net/phy: micrel: Follow attached_dev to get to the MAC device

2015-10-16 Thread Andrew Lunn
On Sat, Oct 17, 2015 at 02:32:30AM +0300, Sergei Shtylyov wrote:
> On 10/16/2015 11:49 PM, Andrew Lunn wrote:
> 
> >Commit "8b63ec1837fa phylib: Make PHYs children of their MDIO bus, not
> >the bus' parent." broke finding PHY properties in the MAC device tree
> 
>You probably forgot to run checkpatch.pl on this patch, else it
> would have complained about the commit citing style. It's <12-bit
> SHA1> ("").

Actual, i did, and decided to ignore it. I'm quoting the regression
report, which formats it that way. However i did deliberately use the
correct format for the fixes: line, where it actually matters.

checkpatch is just a guide, not a rigid rule.

   Andrew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Eric W. Biederman
Alexei Starovoitov  writes:

> We can argue about api for 2nd, whether it's mount with fd=1234 string
> or else, but for the first mount style doesn't make sense.

Why does mount not make sense?  It is exactly what you are looking for
so why does it not make sense?

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net/phy: micrel: Follow attached_dev to get to the MAC device

2015-10-16 Thread Andrew Lunn
On Fri, Oct 16, 2015 at 05:58:41PM -0500, Dinh Nguyen wrote:
> On Fri, 16 Oct 2015, Andrew Lunn wrote:
> 
> > Commit "8b63ec1837fa phylib: Make PHYs children of their MDIO bus, not
> > the bus' parent." broke finding PHY properties in the MAC device tree
> > node. The parent device is now the MDIO bus, not the MAC. Use
> > attached_dev towards the MAC device tree node.
> > 
> > Fixes: 8b63ec1837fa ("phylib: Make PHYs children of their MDIO bus, not the 
> > bus' parent.")
> > Signed-off-by: Andrew Lunn 
> > ---
> > 
> > Compile tested only.
> > 
> > Dinh, please could you test it and report back if it works or not.
> >
> 
> This patch did not seem to fix the problem. The following code did seem to
> fix the problem:
> 
>   if (!of_node && dev->parent->of_node)
> -   of_node = dev->parent->of_node;
> +   do {
> +   of_node = dev->of_node;
> +   dev = dev->parent;
> +   i++;
> +   } while (!of_node && dev);

This might fix the issue, but it has disadvantages. As i said before,
it allows people to place phy properties into the mdio device node. We
want to be reducing placing you can add phy properties, not adding
more.

Please could you try to debug why my patch did not work. Is
attached_dev null?

 Andrew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net] tunnels: Don't require remote endpoint or ID during creation.

2015-10-16 Thread Jesse Gross
Before lightweight tunnels existed, it really didn't make sense to
create a tunnel that was not fully specified, such as without a
destination IP address - the resulting packets would go nowhere.
However, with lightweight tunnels, the opposite is true - it doesn't
make sense to require this information when it will be provided later
on by the route. This loosens the requirements for this information.

An alternative would be to allow the relaxed version only when
COLLECT_METADATA is enabled. However, since there are several
variations on this theme (such as NBMA tunnels in GRE), just dropping
the restrictions seems the most consistent across tunnels and with
the existing configuration.

CC: John Linville 
Signed-off-by: Jesse Gross 
---
 drivers/net/geneve.c | 12 ++--
 drivers/net/vxlan.c  |  7 +++
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 8f5c02e..cde29f8 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -870,14 +870,14 @@ static int geneve_newlink(struct net *net, struct 
net_device *dev,
__be16 dst_port = htons(GENEVE_UDP_PORT);
__u8 ttl = 0, tos = 0;
bool metadata = false;
-   __be32 rem_addr;
-   __u32 vni;
+   __be32 rem_addr = 0;
+   __u32 vni = 0;
 
-   if (!data[IFLA_GENEVE_ID] || !data[IFLA_GENEVE_REMOTE])
-   return -EINVAL;
+   if (data[IFLA_GENEVE_ID])
+   vni = nla_get_u32(data[IFLA_GENEVE_ID]);
 
-   vni = nla_get_u32(data[IFLA_GENEVE_ID]);
-   rem_addr = nla_get_in_addr(data[IFLA_GENEVE_REMOTE]);
+   if (data[IFLA_GENEVE_REMOTE])
+   rem_addr = nla_get_in_addr(data[IFLA_GENEVE_REMOTE]);
 
if (data[IFLA_GENEVE_TTL])
ttl = nla_get_u8(data[IFLA_GENEVE_TTL]);
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index bbac1d3..afdc65f 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2745,11 +2745,10 @@ static int vxlan_newlink(struct net *src_net, struct 
net_device *dev,
struct vxlan_config conf;
int err;
 
-   if (!data[IFLA_VXLAN_ID])
-   return -EINVAL;
-
memset(&conf, 0, sizeof(conf));
-   conf.vni = nla_get_u32(data[IFLA_VXLAN_ID]);
+
+   if (data[IFLA_VXLAN_ID])
+   conf.vni = nla_get_u32(data[IFLA_VXLAN_ID]);
 
if (data[IFLA_VXLAN_GROUP]) {
conf.remote_ip.sin.sin_addr.s_addr = 
nla_get_in_addr(data[IFLA_VXLAN_GROUP]);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net/phy: micrel: Follow attached_dev to get to the MAC device

2015-10-16 Thread Sergei Shtylyov

On 10/16/2015 11:49 PM, Andrew Lunn wrote:


Commit "8b63ec1837fa phylib: Make PHYs children of their MDIO bus, not
the bus' parent." broke finding PHY properties in the MAC device tree


   You probably forgot to run checkpatch.pl on this patch, else it would have 
complained about the commit citing style. It's <12-bit SHA1> ("").



node. The parent device is now the MDIO bus, not the MAC. Use
attached_dev towards the MAC device tree node.

Fixes: 8b63ec1837fa ("phylib: Make PHYs children of their MDIO bus, not the bus' 
parent.")


   Yeah, exactly like this.


Signed-off-by: Andrew Lunn 

[...]

MBR, Sergei


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] ip-route man: add usage and description for lwtunnel encap attributes

2015-10-16 Thread Stephen Hemminger
On Thu, 15 Oct 2015 13:13:40 +0200
Thomas Graf  wrote:

> From: Roopa Prabhu 
> 
> This patch updates ip-route man page with lwtunnel encap
> usage and description, covering MPLS and IP encapsulation.
> 
> Signed-off-by: Roopa Prabhu 
> Signed-off-by: Thomas Graf 

I applied all 3 patches. The first one into master (since it might be useful for
other code), and other two to net-next.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] uapi: add mpls_iptunnel.h

2015-10-16 Thread Stephen Hemminger
Add missing rule to export mpls iptunnel header needed by iproute2

Signed-off-by: Stephen Hemminger 

--- a/include/uapi/linux/Kbuild 2015-10-12 08:26:44.745317580 -0700
+++ b/include/uapi/linux/Kbuild 2015-10-16 16:09:39.362933576 -0700
@@ -263,6 +263,7 @@ header-y += minix_fs.h
 header-y += mman.h
 header-y += mmtimer.h
 header-y += mpls.h
+header-y += mpls_iptunnel.h
 header-y += mqueue.h
 header-y += mroute6.h
 header-y += mroute.h
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net/phy: micrel: Follow attached_dev to get to the MAC device

2015-10-16 Thread Dinh Nguyen
On Fri, 16 Oct 2015, Andrew Lunn wrote:

> Commit "8b63ec1837fa phylib: Make PHYs children of their MDIO bus, not
> the bus' parent." broke finding PHY properties in the MAC device tree
> node. The parent device is now the MDIO bus, not the MAC. Use
> attached_dev towards the MAC device tree node.
> 
> Fixes: 8b63ec1837fa ("phylib: Make PHYs children of their MDIO bus, not the 
> bus' parent.")
> Signed-off-by: Andrew Lunn 
> ---
> 
> Compile tested only.
> 
> Dinh, please could you test it and report back if it works or not.
>

This patch did not seem to fix the problem. The following code did seem to
fix the problem:

if (!of_node && dev->parent->of_node)
-   of_node = dev->parent->of_node;
+   do {
+   of_node = dev->of_node;
+   dev = dev->parent;
+   i++;
+   } while (!of_node && dev);

BR,
Dinh
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 iproute2] ip-address: fix oneline mode for interfaces with VF

2015-10-16 Thread Stephen Hemminger
On Fri, 16 Oct 2015 12:38:33 +0200
Phil Sutter  wrote:

> Signed-off-by: Phil Sutter 
> ---
> Changed since v1: Fix commit message - typo in From: header confused
>   git-send-email.

Ok, applied
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH iproute2] ip monitor neigh: Change 'delete' to 'Deleted' to be consistent with ip route

2015-10-16 Thread Stephen Hemminger
On Thu, 15 Oct 2015 11:47:43 -0700
Roopa Prabhu  wrote:

> From: Roopa Prabhu 
> 
> It helps to grep for one string "Deleted" when monitoring all events.
> 
> Fixes: 6ea3ebafe077 ("iproute2: inform user when a neighbor is removed")
> Signed-off-by: Roopa Prabhu 
> ---
> I am not sure if it is too late for this change. But, sending this patch 
> out because it only affects ip monitor output
> 

Sure applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 2/2] bpf: control all the perf events stored in PERF_EVENT_ARRAY maps

2015-10-16 Thread Alexei Starovoitov

On 10/16/15 12:42 AM, Kaixu Xia wrote:

This patch implements the function that controlling all the perf
events stored in PERF_EVENT_ARRAY maps by setting the parameter
'index' to maps max_entries.

Signed-off-by: Kaixu Xia 
---
  kernel/trace/bpf_trace.c | 20 ++--
  1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 3175600..4b385863 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -229,13 +229,30 @@ static u64 bpf_perf_event_dump_control(u64 r1, u64 index, 
u64 flag, u64 r4, u64
struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
struct bpf_array *array = container_of(map, struct bpf_array, map);
struct perf_event *event;
+   int i;

-   if (unlikely(index >= array->map.max_entries))
+   if (unlikely(index > array->map.max_entries))
return -E2BIG;

if (flag & BIT_FLAG_CHECK)
return -EINVAL;

+   if (index == array->map.max_entries) {


I don't like in-band signaling like this, since it's easy to make
mistake on the bpf program side.
Please use 2nd bit of 'flags' for that instead.
Also squash this patch into 1st.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 1/2] bpf: control the trace data output on current cpu when perf sampling

2015-10-16 Thread Alexei Starovoitov

On 10/16/15 12:42 AM, Kaixu Xia wrote:

This patch adds the flag dump_enable to control the trace data
output process when perf sampling. By setting this flag and
integrating with ebpf, we can control the data output process and
get the samples we are most interested in.

The bpf helper bpf_perf_event_dump_control() can control the
perf_event on current cpu.

Signed-off-by: Kaixu Xia 
---
  include/linux/perf_event.h  |  1 +
  include/uapi/linux/bpf.h|  5 +
  include/uapi/linux/perf_event.h |  3 ++-
  kernel/bpf/verifier.c   |  3 ++-
  kernel/events/core.c| 13 
  kernel/trace/bpf_trace.c| 44 +
  6 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 092a0e8..2af527e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -472,6 +472,7 @@ struct perf_event {
struct irq_work pending;

atomic_tevent_limit;
+   atomic_tdump_enable;


The naming is the hardest...
How about calling it 'soft_enable' instead?


--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -287,6 +287,11 @@ enum bpf_func_id {
 * Return: realm if != 0
 */
BPF_FUNC_get_route_realm,
+
+   /**
+* u64 bpf_perf_event_dump_control(&map, index, flag)
+*/
+   BPF_FUNC_perf_event_dump_control,


and this one is too long.
May be bpf_perf_event_control() ?

Daniel, any thoughts on naming?


--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -331,7 +331,8 @@ struct perf_event_attr {
comm_exec  :  1, /* flag comm events that 
are due to an exec */
use_clockid:  1, /* use @clockid for time 
fields */
context_switch :  1, /* context switch data */
-   __reserved_1   : 37;
+   dump_enable:  1, /* don't output data on 
samples */


either comment or name is wrong.
how about calling this one 'soft_disable',
since you want zero to be default and the event should be on.


diff --git a/kernel/events/core.c b/kernel/events/core.c
index b11756f..74a16af 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event,
irq_work_queue(&event->pending);
}

+   if (!atomic_read(&event->dump_enable))
+   return ret;


I'm not an expert in this piece of perf, but should it be 'return 0'
instead ?
and may be moved to is_sampling_event() check?
Also please add unlikely().


+static void perf_event_check_dump_flag(struct perf_event *event)
+{
+   if (event->attr.dump_enable == 1)
+   atomic_set(&event->dump_enable, 1);
+   else
+   atomic_set(&event->dump_enable, 0);


that looks like it breaks perf, since default for bits is zero
and all events will be soft-disabled?
How did you test it?
Please add a test to samples/bpf/ for this feature.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: hix5hd2_gmac: avoid integer overload warning

2015-10-16 Thread Arnd Bergmann
On Friday 16 October 2015 14:22:15 Joe Perches wrote:
> On Fri, 2015-10-16 at 21:50 +0300, Sergei Shtylyov wrote:
> > On 10/16/2015 09:04 PM, Joe Perches wrote:
> > 
> >  BITS_RX_EN is an 'unsigned long' constant, so the ones complement of 
> >  that
> >  has bits set that do not fit into a 32-bit variable on 64-bit 
> >  architectures,
> >  which causes a harmless gcc warning:
> > >>> ...
> >    static void hix5hd2_port_disable(struct hix5hd2_priv *priv)
> >    {
> >  - writel_relaxed(~(BITS_RX_EN | BITS_TX_EN), priv->base + PORT_EN);
> >  + writel_relaxed(~(u32)(BITS_RX_EN | BITS_TX_EN), priv->base + 
> >  PORT_EN);
> > writel_relaxed(0, priv->base + DESC_WR_RD_ENA);
> > >>>
> > >>> ISTM that just means that the constants shouldn't be 'long'.
> > >>
> > >> Right, but that would probably mean changing the BIT() macro or not 
> > >> using it
> > >> here. In the past I've argued against using that macro, but I've given
> > >> up that fight.
> > >
> > > Fight on... (Somebody must have gone to USC here)

Ok, I'll try:

Please stop this nonsense!

;-)

> > > There might be value in aefin BIT_U32 macro.
> > > Maybe BIT_U64 too.
> > 
> > There's BIT_ULL() already.
> 
> I know, but symmetry is good.
> I think there'd be no harm in adding it.
> Perhaps adding all the sized variants would be useful.
> 
> Something like:
> 
> #define BIT_OF_TYPE(type, nr) \
> ({\
>   typeof(type) rtn;   \
>   BUILD_BUG_ON(__builtin_constant_p(nr) &&\
>((nr) < 0 ||   \
> (nr) >= sizeof(type) * BITS_PER_BYTE));   \
>   rtn = ((type)1) << (nr);\
>   rtn;\
> })
> 
> #define BIT_U8(nr)BIT_OF_TYPE(u8, nr)
> #define BIT_U16(nr)   BIT_OF_TYPE(u16, nr)
> #define BIT_U32(nr)   BIT_OF_TYPE(u32, nr)
> #define BIT_U64(nr)   BIT_OF_TYPE(u64, nr)

As I said, I'd rather see less uses of BIT() instead of more. While
using 'BIT(23)' is often than the open-coded '1 << 23', I wish more
people would write that as '0x0080' instead. It's easier to
match with data sheets, and to compare to printk output, plus
it's non-ambiguous if you are dealing with data sheets that use
the IBM convention of counting the bits from the other end.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 1/2] drivers: net: xgene: Add support RGMII TX/RX delay configuration

2015-10-16 Thread Florian Fainelli
On 16/10/15 13:35, Iyappan Subramanian wrote:
> Add RGMII TX/RX delay configuration support. RGMII standard requires 2ns
> delay to help the RGMII bridge receiver to sample data correctly. If the
> default value does not provide proper centering of the data sample, the
> TX/RX delay parameters can be used to adjust accordingly.

There is a standard 'phy-mode' property which can take multiple values
for RGMII: "rgmii" (no delay), "rgmii-txid" (transmit delay),
"rgmii-rxid" (receive delay) and "rgmii-id" (symetric delay). There does
not seem to be any verification of whether the rx or tx delay parameters
you introduce are going to be either sensible, or not conflicting with
the 'phy-mode' that should be configured.

> 
> Signed-off-by: Iyappan Subramanian 
> ---
>  drivers/net/ethernet/apm/xgene/xgene_enet_hw.c   |  8 +++-
>  drivers/net/ethernet/apm/xgene/xgene_enet_hw.h   |  1 +
>  drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 49 
> 
>  drivers/net/ethernet/apm/xgene/xgene_enet_main.h |  2 +
>  4 files changed, 59 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c 
> b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
> index 652f218..33850a0 100644
> --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
> +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
> @@ -461,6 +461,7 @@ static void xgene_gmac_reset(struct xgene_enet_pdata 
> *pdata)
>  
>  static void xgene_gmac_init(struct xgene_enet_pdata *pdata)
>  {
> + struct device *dev = &pdata->pdev->dev;
>   u32 value, mc2;
>   u32 intf_ctl, rgmii;
>   u32 icm0, icm2;
> @@ -490,7 +491,12 @@ static void xgene_gmac_init(struct xgene_enet_pdata 
> *pdata)
>   default:
>   ENET_INTERFACE_MODE2_SET(&mc2, 2);
>   intf_ctl |= ENET_GHD_MODE;
> - CFG_TXCLK_MUXSEL0_SET(&rgmii, 4);
> +
> + if (dev->of_node) {
> + CFG_TXCLK_MUXSEL0_SET(&rgmii, pdata->tx_delay);
> + CFG_RXCLK_MUXSEL0_SET(&rgmii, pdata->rx_delay);
> + }
> +
>   xgene_enet_rd_csr(pdata, DEBUG_REG_ADDR, &value);
>   value |= CFG_BYPASS_UNISEC_TX | CFG_BYPASS_UNISEC_RX;
>   xgene_enet_wr_csr(pdata, DEBUG_REG_ADDR, value);
> diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h 
> b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
> index ff05bbc..6dee73c 100644
> --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
> +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
> @@ -144,6 +144,7 @@ enum xgene_enet_rm {
>  #define CFG_BYPASS_UNISEC_RX BIT(1)
>  #define CFG_CLE_BYPASS_EN0   BIT(31)
>  #define CFG_TXCLK_MUXSEL0_SET(dst, val)  xgene_set_bits(dst, val, 29, 3)
> +#define CFG_RXCLK_MUXSEL0_SET(dst, val)  xgene_set_bits(dst, val, 26, 3)
>  
>  #define CFG_CLE_IP_PROTOCOL0_SET(dst, val)   xgene_set_bits(dst, val, 16, 2)
>  #define CFG_CLE_DSTQID0_SET(dst, val)xgene_set_bits(dst, 
> val, 0, 12)
> diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c 
> b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
> index 6b1846d..ce10687 100644
> --- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
> +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
> @@ -1118,6 +1118,47 @@ static int xgene_get_port_id_dt(struct device *dev, 
> struct xgene_enet_pdata *pda
>   return ret;
>  }
>  
> +static int xgene_get_tx_delay(struct xgene_enet_pdata *pdata)
> +{
> + struct device *dev = &pdata->pdev->dev;
> + int delay, ret;
> +
> + ret = of_property_read_u32(dev->of_node, "tx-delay", &delay);
> + if (ret) {
> + pdata->tx_delay = 4;
> + return 0;
> + }
> +
> + if (delay < 0 || delay > 7) {
> + dev_err(dev, "Invalid tx-delay specified\n");
> + return -EINVAL;
> + }
> +
> + pdata->tx_delay = delay;
> +
> + return 0;
> +}
> +
> +static int xgene_get_rx_delay(struct xgene_enet_pdata *pdata)
> +{
> + struct device *dev = &pdata->pdev->dev;
> + int delay, ret;
> +
> + ret = of_property_read_u32(dev->of_node, "rx-delay", &delay);
> + if (ret) {
> + pdata->rx_delay = 2;
> + return 0;
> + }
> +
> + if (delay < 0 || delay > 7) {
> + dev_err(dev, "Invalid rx-delay specified\n");
> + return -EINVAL;
> + }
> +
> + pdata->rx_delay = delay;
> +
> + return 0;
> +}
>  
>  static int xgene_enet_get_resources(struct xgene_enet_pdata *pdata)
>  {
> @@ -1194,6 +1235,14 @@ static int xgene_enet_get_resources(struct 
> xgene_enet_pdata *pdata)
>   return -ENODEV;
>   }
>  
> + ret = xgene_get_tx_delay(pdata);
> + if (ret)
> + return ret;
> +
> + ret = xgene_get_rx_delay(pdata);
> + if (ret)
> + return ret;
> +
>   ret = platform_get_irq(pdev, 0);
>   if (ret <= 0) {
>   dev_err(dev, "Unable to get ENET Rx IRQ\n");
> diff --git a/d

Re: [PATCH] net: hix5hd2_gmac: avoid integer overload warning

2015-10-16 Thread Joe Perches
On Fri, 2015-10-16 at 21:50 +0300, Sergei Shtylyov wrote:
> On 10/16/2015 09:04 PM, Joe Perches wrote:
> 
>  BITS_RX_EN is an 'unsigned long' constant, so the ones complement of that
>  has bits set that do not fit into a 32-bit variable on 64-bit 
>  architectures,
>  which causes a harmless gcc warning:
> >>> ...
>    static void hix5hd2_port_disable(struct hix5hd2_priv *priv)
>    {
>  - writel_relaxed(~(BITS_RX_EN | BITS_TX_EN), priv->base + PORT_EN);
>  + writel_relaxed(~(u32)(BITS_RX_EN | BITS_TX_EN), priv->base + 
>  PORT_EN);
> writel_relaxed(0, priv->base + DESC_WR_RD_ENA);
> >>>
> >>> ISTM that just means that the constants shouldn't be 'long'.
> >>
> >> Right, but that would probably mean changing the BIT() macro or not using 
> >> it
> >> here. In the past I've argued against using that macro, but I've given
> >> up that fight.
> >
> > Fight on... (Somebody must have gone to USC here)
> >
> > There might be value in aefin BIT_U32 macro.
> > Maybe BIT_U64 too.
> 
> There's BIT_ULL() already.

I know, but symmetry is good.
I think there'd be no harm in adding it.
Perhaps adding all the sized variants would be useful.

Something like:

#define BIT_OF_TYPE(type, nr)   \
({  \
typeof(type) rtn;   \
BUILD_BUG_ON(__builtin_constant_p(nr) &&\
 ((nr) < 0 ||   \
  (nr) >= sizeof(type) * BITS_PER_BYTE));   \
rtn = ((type)1) << (nr);\
rtn;\
})

#define BIT_U8(nr)  BIT_OF_TYPE(u8, nr)
#define BIT_U16(nr) BIT_OF_TYPE(u16, nr)
#define BIT_U32(nr) BIT_OF_TYPE(u32, nr)
#define BIT_U64(nr) BIT_OF_TYPE(u64, nr)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Alexei Starovoitov

On 10/16/15 12:53 PM, Eric W. Biederman wrote:

Alexei Starovoitov  writes:


On 10/16/15 11:41 AM, Eric W. Biederman wrote:

[...]

I am missing something.

When I suggested using a filesystem it was my thought there would be
exactly one superblock per map, and the map would be specified at mount
time.  You clearly are not implementing that.


I don't think it's practical to have sb per map, since that would mean
sb per prog and that won't scale.


What do you mean won't scale?  You want to have a name per map/prog so the
basic complexity appears the same.  Is there some crucial interaction
between the persistent dodads you are placing on a filesystem that I am
missing?

Given the fact you don't normally need any persistence without a program
I am puzzled why "scaling" is an issue of any kind.  This is for a
comparitively rare case if I am not mistaken.


representing map as a directory tree with files as keys is indeed 'rare'
since it's mainly for debugging and slow accesses,
but 'pin_fd' functionality now popping up everywhere.
Mainly because in things like openstack there are tons of disjoint
libraries written in different languages and the only thing
common is kernel. So pin_fd/new_fd is a mandatory feature.


Also map today is an fd that belongs to a process. I cannot see
an api from C program to do 'mount of FD' that wouldn't look like
ugly hack.


mount -t bpffs ... -o fd=1234

That is not all convoluted or hacky.  Especially compared to some of the
alternatives I am seeing.

It is no problem at all to wrap something like that in a nice function
call that has the exact same complexity of use as any of the other
options that are being explored to give something that starts out
as a filedescriptor a name.


Frankly, I don't think parsing 'fd=1234' string is a clean api, but
before we argue about fs philosophy of passing options, let's
get on the same page with requirements.
First goal that this patch is solving is providing an ability
to 'pin' an FD, so that map/prog won't disappear when user app exist.
Second goal of future patches is to expose map internals as a directory
structure.
These two goals are independent.
We can argue about api for 2nd, whether it's mount with fd=1234 string
or else, but for the first mount style doesn't make sense.


A filesystem per map makes sense as you have a key-value store with one
file per key.

The idea is that something resembling your bpf_pin_fd function would be
the mount system call for the filesystem.

The the keys in the map could be read by "ls /mountpoint/".
Key values could be inspected with "cat /mountpoint/key".


yes. that is still the goal for follow up patches, but contained
within given bpffs. Something bpf_pin_fd-like command for bpf syscall
would create files for keys in a map and allow 'cat' via open/read.
Such api would be much cleaner from C app point of view.
Potentially we can allow mount of a file created via BPF_PIN_FD
that will expand into keys/values.
All of that are our future plans.
There, actually, the main contention point is 'how to represent keys
and values'. whether key is hex representation or we need some
pretty-printers via format string or via schema? etc, etc.
We tried few ideas of representing keys in our fuse implementations,
but don't have an agreement yet.


My gut feel would be to keep it simple and use the same representation
you use in your existing system calls.  Certainly ordinary filenames are
keys of arbitrary binary data that can included everything except
a '\0' byte.  That they are human readable is a nice convention, but not
at all fundamental to what they are.


that doesn't work. map keys are never human readable. they're arbitrary
binary data. That's why representing them as file name is not trivial.
Some pretty-printer is needed.
Again that is 2nd goal of bpffs in general. We cannot really solve it
now, because we cannot say 'lets represent keys like X and work
from there', since that will become kernel ABI and we won't be able to
change that.
It's equally not clear that thousands of keys can even work as files.
So quite a bit of brainstorming still to do for this 2nd goal.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net/phy: micrel: Follow attached_dev to get to the MAC device

2015-10-16 Thread Andrew Lunn
Commit "8b63ec1837fa phylib: Make PHYs children of their MDIO bus, not
the bus' parent." broke finding PHY properties in the MAC device tree
node. The parent device is now the MDIO bus, not the MAC. Use
attached_dev towards the MAC device tree node.

Fixes: 8b63ec1837fa ("phylib: Make PHYs children of their MDIO bus, not the 
bus' parent.")
Signed-off-by: Andrew Lunn 
---

Compile tested only.

Dinh, please could you test it and report back if it works or not.

 drivers/net/phy/micrel.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
index 499185eaf413..9e2083a1a5a9 100644
--- a/drivers/net/phy/micrel.c
+++ b/drivers/net/phy/micrel.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -340,8 +341,9 @@ static int ksz9021_config_init(struct phy_device *phydev)
const struct device *dev = &phydev->dev;
const struct device_node *of_node = dev->of_node;
 
-   if (!of_node && dev->parent->of_node)
-   of_node = dev->parent->of_node;
+   if (!of_node && phydev->attached_dev &&
+   phydev->attached_dev->dev.of_node)
+   of_node = phydev->attached_dev->dev.of_node;
 
if (of_node) {
ksz9021_load_values_from_of(phydev, of_node,
-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 1/2] drivers: net: xgene: Add support RGMII TX/RX delay configuration

2015-10-16 Thread Iyappan Subramanian
Add RGMII TX/RX delay configuration support. RGMII standard requires 2ns
delay to help the RGMII bridge receiver to sample data correctly. If the
default value does not provide proper centering of the data sample, the
TX/RX delay parameters can be used to adjust accordingly.

Signed-off-by: Iyappan Subramanian 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.c   |  8 +++-
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.h   |  1 +
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 49 
 drivers/net/ethernet/apm/xgene/xgene_enet_main.h |  2 +
 4 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
index 652f218..33850a0 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c
@@ -461,6 +461,7 @@ static void xgene_gmac_reset(struct xgene_enet_pdata *pdata)
 
 static void xgene_gmac_init(struct xgene_enet_pdata *pdata)
 {
+   struct device *dev = &pdata->pdev->dev;
u32 value, mc2;
u32 intf_ctl, rgmii;
u32 icm0, icm2;
@@ -490,7 +491,12 @@ static void xgene_gmac_init(struct xgene_enet_pdata *pdata)
default:
ENET_INTERFACE_MODE2_SET(&mc2, 2);
intf_ctl |= ENET_GHD_MODE;
-   CFG_TXCLK_MUXSEL0_SET(&rgmii, 4);
+
+   if (dev->of_node) {
+   CFG_TXCLK_MUXSEL0_SET(&rgmii, pdata->tx_delay);
+   CFG_RXCLK_MUXSEL0_SET(&rgmii, pdata->rx_delay);
+   }
+
xgene_enet_rd_csr(pdata, DEBUG_REG_ADDR, &value);
value |= CFG_BYPASS_UNISEC_TX | CFG_BYPASS_UNISEC_RX;
xgene_enet_wr_csr(pdata, DEBUG_REG_ADDR, value);
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h 
b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
index ff05bbc..6dee73c 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h
@@ -144,6 +144,7 @@ enum xgene_enet_rm {
 #define CFG_BYPASS_UNISEC_RX   BIT(1)
 #define CFG_CLE_BYPASS_EN0 BIT(31)
 #define CFG_TXCLK_MUXSEL0_SET(dst, val)xgene_set_bits(dst, val, 29, 3)
+#define CFG_RXCLK_MUXSEL0_SET(dst, val)xgene_set_bits(dst, val, 26, 3)
 
 #define CFG_CLE_IP_PROTOCOL0_SET(dst, val) xgene_set_bits(dst, val, 16, 2)
 #define CFG_CLE_DSTQID0_SET(dst, val)  xgene_set_bits(dst, val, 0, 12)
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
index 6b1846d..ce10687 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
@@ -1118,6 +1118,47 @@ static int xgene_get_port_id_dt(struct device *dev, 
struct xgene_enet_pdata *pda
return ret;
 }
 
+static int xgene_get_tx_delay(struct xgene_enet_pdata *pdata)
+{
+   struct device *dev = &pdata->pdev->dev;
+   int delay, ret;
+
+   ret = of_property_read_u32(dev->of_node, "tx-delay", &delay);
+   if (ret) {
+   pdata->tx_delay = 4;
+   return 0;
+   }
+
+   if (delay < 0 || delay > 7) {
+   dev_err(dev, "Invalid tx-delay specified\n");
+   return -EINVAL;
+   }
+
+   pdata->tx_delay = delay;
+
+   return 0;
+}
+
+static int xgene_get_rx_delay(struct xgene_enet_pdata *pdata)
+{
+   struct device *dev = &pdata->pdev->dev;
+   int delay, ret;
+
+   ret = of_property_read_u32(dev->of_node, "rx-delay", &delay);
+   if (ret) {
+   pdata->rx_delay = 2;
+   return 0;
+   }
+
+   if (delay < 0 || delay > 7) {
+   dev_err(dev, "Invalid rx-delay specified\n");
+   return -EINVAL;
+   }
+
+   pdata->rx_delay = delay;
+
+   return 0;
+}
 
 static int xgene_enet_get_resources(struct xgene_enet_pdata *pdata)
 {
@@ -1194,6 +1235,14 @@ static int xgene_enet_get_resources(struct 
xgene_enet_pdata *pdata)
return -ENODEV;
}
 
+   ret = xgene_get_tx_delay(pdata);
+   if (ret)
+   return ret;
+
+   ret = xgene_get_rx_delay(pdata);
+   if (ret)
+   return ret;
+
ret = platform_get_irq(pdev, 0);
if (ret <= 0) {
dev_err(dev, "Unable to get ENET Rx IRQ\n");
diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.h 
b/drivers/net/ethernet/apm/xgene/xgene_enet_main.h
index ff89a5d..a6e56b8 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.h
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.h
@@ -184,6 +184,8 @@ struct xgene_enet_pdata {
u8 bp_bufnum;
u16 ring_num;
u32 mss;
+   u8 tx_delay;
+   u8 rx_delay;
 };
 
 struct xgene_indirect_ctl {
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://v

[PATCH net-next 0/2] drivers: xgene: Add support RGMII TX/RX delay configuration

2015-10-16 Thread Iyappan Subramanian
This patch adds support RGMII TX/RX delay configuration.

Signed-off-by: Iyappan Subramanian 
---

Iyappan Subramanian (2):
  drivers: net: xgene: Add support RGMII TX/RX delay configuration
  Documentation: dts: xgene: Add TX/RX delay field

 .../devicetree/bindings/net/apm-xgene-enet.txt | 10 +
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.c |  8 +++-
 drivers/net/ethernet/apm/xgene/xgene_enet_hw.h |  1 +
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c   | 49 ++
 drivers/net/ethernet/apm/xgene/xgene_enet_main.h   |  2 +
 5 files changed, 69 insertions(+), 1 deletion(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 2/2] Documentation: dts: xgene: Add TX/RX delay field

2015-10-16 Thread Iyappan Subramanian
Signed-off-by: Iyappan Subramanian 
---
 Documentation/devicetree/bindings/net/apm-xgene-enet.txt | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/apm-xgene-enet.txt 
b/Documentation/devicetree/bindings/net/apm-xgene-enet.txt
index f55aa28..5b17c13 100644
--- a/Documentation/devicetree/bindings/net/apm-xgene-enet.txt
+++ b/Documentation/devicetree/bindings/net/apm-xgene-enet.txt
@@ -37,6 +37,14 @@ Required properties for ethernet interfaces that have 
external PHY:
 
 Optional properties:
 - status: Should be "ok" or "disabled" for enabled/disabled. Default is "ok".
+- tx-delay: Delay value for RGMII TXD timming.
+   Valid values are between 0 to 7, that maps to
+   417, 717, 1020, 1321, 1611, 1913, 2215, 2514 ps
+   Default value is 4, which corresponds to 1611 ps
+- rx-delay: Delay value for RGMII RXD timming in ps.
+   Valid values are between 0 to 7, that maps to
+   273, 589, 899, 1222, 1480, 1806, 2147, 2464 ps
+   Default value is 2, which corresponds to 899 ps
 
 Example:
menetclk: menetclk {
@@ -72,5 +80,7 @@ Example:
 
 /* Board-specific peripheral configurations */
 &menet {
+   tx-delay = <4>;
+   rx-delay = <2>;
 status = "ok";
 };
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SoCFPGA ethernet broken

2015-10-16 Thread Andrew Lunn
On Fri, Oct 16, 2015 at 01:24:16PM -0700, David Daney wrote:
> On 10/16/2015 12:38 PM, Andrew Lunn wrote:
> >>>Maybe we need to walk up the hierarchy.
> >>>
> >>>Perhaps something like:
> >>>
> >>>const struct device *dev_walker;
> >>>
> >>>dev_walker = &phydev->dev;
> >>>do {
> >>>of_node = dev_walker->of_node;
> >>>dev_walker = dev_walker->parent;
> >>>} while (!of_node && dev_walker);
> >>>
> >>
> >>The above code seems to have fixed the issue.
> >
> >What i don't like about this is that it allows you to put these
> >properties in the mdio device node. These are phy properties, not mdio
> >properties
> 
> Yes, we know that the device tree is not correctly formed.
> 
> You have a choice:
> 
> A) Fix the device tree and any code that may have to change to work
> with a good device tree.
> 
> B) Change the code to work with the screwy existing device tree.
> The above seems to work, other things may also be possible.
> 
> I can't really make any decisions as to what the best way to proceed
> is, as I neither have the hardware in question, nor the time to work
> on it.

Hi hope Dinh will test with phydev->attached_dev->dev->of_node.

   Andrew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SoCFPGA ethernet broken

2015-10-16 Thread David Daney

On 10/16/2015 12:38 PM, Andrew Lunn wrote:

Maybe we need to walk up the hierarchy.

Perhaps something like:

const struct device *dev_walker;

dev_walker = &phydev->dev;
do {
of_node = dev_walker->of_node;
dev_walker = dev_walker->parent;
} while (!of_node && dev_walker);



The above code seems to have fixed the issue.


What i don't like about this is that it allows you to put these
properties in the mdio device node. These are phy properties, not mdio
properties


Yes, we know that the device tree is not correctly formed.

You have a choice:

A) Fix the device tree and any code that may have to change to work with 
a good device tree.


B) Change the code to work with the screwy existing device tree.  The 
above seems to work, other things may also be possible.


I can't really make any decisions as to what the best way to proceed is, 
as I neither have the hardware in question, nor the time to work on it.


David Daney



If phydev->attached_dev->dev->of_node works, that would be my
preference.

Andrew



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Eric W. Biederman
Alexei Starovoitov  writes:

> On 10/16/15 11:41 AM, Eric W. Biederman wrote:
[...]
>> I am missing something.
>>
>> When I suggested using a filesystem it was my thought there would be
>> exactly one superblock per map, and the map would be specified at mount
>> time.  You clearly are not implementing that.
>
> I don't think it's practical to have sb per map, since that would mean
> sb per prog and that won't scale.

What do you mean won't scale?  You want to have a name per map/prog so the
basic complexity appears the same.  Is there some crucial interaction
between the persistent dodads you are placing on a filesystem that I am
missing?

Given the fact you don't normally need any persistence without a program
I am puzzled why "scaling" is an issue of any kind.  This is for a
comparitively rare case if I am not mistaken.

> Also map today is an fd that belongs to a process. I cannot see
> an api from C program to do 'mount of FD' that wouldn't look like
> ugly hack.

mount -t bpffs ... -o fd=1234 

That is not all convoluted or hacky.  Especially compared to some of the
alternatives I am seeing.

It is no problem at all to wrap something like that in a nice function
call that has the exact same complexity of use as any of the other
options that are being explored to give something that starts out
as a filedescriptor a name.

>> A filesystem per map makes sense as you have a key-value store with one
>> file per key.
>>
>> The idea is that something resembling your bpf_pin_fd function would be
>> the mount system call for the filesystem.
>>
>> The the keys in the map could be read by "ls /mountpoint/".
>> Key values could be inspected with "cat /mountpoint/key".
>
> yes. that is still the goal for follow up patches, but contained
> within given bpffs. Something bpf_pin_fd-like command for bpf syscall
> would create files for keys in a map and allow 'cat' via open/read.
> Such api would be much cleaner from C app point of view.
> Potentially we can allow mount of a file created via BPF_PIN_FD
> that will expand into keys/values.
> All of that are our future plans.
> There, actually, the main contention point is 'how to represent keys
> and values'. whether key is hex representation or we need some
> pretty-printers via format string or via schema? etc, etc.
> We tried few ideas of representing keys in our fuse implementations,
> but don't have an agreement yet.

My gut feel would be to keep it simple and use the same representation
you use in your existing system calls.  Certainly ordinary filenames are
keys of arbitrary binary data that can included everything except
a '\0' byte.  That they are human readable is a nice convention, but not
at all fundamental to what they are.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] tcp: do not set queue_mapping on SYNACK

2015-10-16 Thread Eric Dumazet
From: Eric Dumazet 

At the time of commit fff326990789 ("tcp: reflect SYN queue_mapping into
SYNACK packets") we had little ways to cope with SYN floods.

We no longer need to reflect incoming skb queue mappings, and instead
can pick a TX queue based on cpu cooking the SYNACK, with normal XPS
affinities.

Note that all SYNACK retransmits were picking TX queue 0, this no longer
is a win given that SYNACK rtx are now distributed on all cpus.

Signed-off-by: Eric Dumazet 
---
 include/net/tcp.h |2 +-
 net/ipv4/ip_output.c  |1 -
 net/ipv4/tcp_input.c  |4 ++--
 net/ipv4/tcp_ipv4.c   |2 --
 net/ipv4/tcp_output.c |2 +-
 net/ipv6/tcp_ipv6.c   |2 --
 6 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index a6be56d5f0e3..eed94fc355c1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1716,7 +1716,7 @@ struct tcp_request_sock_ops {
__u32 (*init_seq)(const struct sk_buff *skb);
int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
   struct flowi *fl, struct request_sock *req,
-  u16 queue_mapping, struct tcp_fastopen_cookie *foc,
+  struct tcp_fastopen_cookie *foc,
   bool attach_req);
 };
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 67404e1fe7d4..50e29737b584 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1596,7 +1596,6 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
  arg->csumoffset) = csum_fold(csum_add(nskb->csum,
arg->csum));
nskb->ip_summed = CHECKSUM_NONE;
-   skb_set_queue_mapping(nskb, skb_get_queue_mapping(skb));
ip_push_pending_frames(sk, &fl4);
}
 out:
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3b35c3f4d268..944eaca69115 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6236,7 +6236,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
}
if (fastopen_sk) {
af_ops->send_synack(fastopen_sk, dst, &fl, req,
-   skb_get_queue_mapping(skb), &foc, false);
+   &foc, false);
/* Add the child socket directly into the accept queue */
inet_csk_reqsk_queue_add(sk, req, fastopen_sk);
sk->sk_data_ready(sk);
@@ -6247,7 +6247,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
if (!want_cookie)
inet_csk_reqsk_queue_hash_add(sk, req, 
TCP_TIMEOUT_INIT);
af_ops->send_synack(sk, dst, &fl, req,
-   skb_get_queue_mapping(skb), &foc, 
!want_cookie);
+   &foc, !want_cookie);
if (want_cookie)
goto drop_and_free;
}
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9c68cf3762c4..30dd45c1f568 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -821,7 +821,6 @@ static void tcp_v4_reqsk_send_ack(const struct sock *sk, 
struct sk_buff *skb,
 static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
  struct flowi *fl,
  struct request_sock *req,
- u16 queue_mapping,
  struct tcp_fastopen_cookie *foc,
  bool attach_req)
 {
@@ -839,7 +838,6 @@ static int tcp_v4_send_synack(const struct sock *sk, struct 
dst_entry *dst,
if (skb) {
__tcp_v4_send_check(skb, ireq->ir_loc_addr, ireq->ir_rmt_addr);
 
-   skb_set_queue_mapping(skb, queue_mapping);
err = ip_build_and_send_pkt(skb, sk, ireq->ir_loc_addr,
ireq->ir_rmt_addr,
ireq->opt);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6e79fcb0addb..19adedb8c5cc 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3518,7 +3518,7 @@ int tcp_rtx_synack(const struct sock *sk, struct 
request_sock *req)
int res;
 
tcp_rsk(req)->txhash = net_tx_rndhash();
-   res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL, true);
+   res = af_ops->send_synack(sk, NULL, &fl, req, NULL, true);
if (!res) {
TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index acb06f86f372..f495d189f5e0 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -437,7 +437,6 @@ out:
 static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst,
  struct flowi *fl,
  struct request_so

Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Daniel Borkmann

On 10/16/2015 09:27 PM, Alexei Starovoitov wrote:

On 10/16/15 11:41 AM, Eric W. Biederman wrote:

Daniel Borkmann  writes:

On 10/16/2015 07:42 PM, Alexei Starovoitov wrote:

On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote:

Another question:
Should multiple mount of the filesystem result in an empty fs (a new
instance) or in one were one can see other ebpf-fs entities? I think
Daniel wanted to already use the mountpoint as some kind of hierarchy
delimiter. I would have used directories for that and multiple mounts
would then have resulted in the same content of the filesystem. IMHO
this would remove some ambiguity but then the question arises how this
is handled in a namespaced environment. Was there some specific reason
to do so?


That's an interesting question!
I think all mounts should be independent.
I can see tracing using one and networking using another one
with different hierarchies suitable for their own use cases.
What's an advantage to have the same content everywhere?
Feels harder to manage, since different users would need to
coordinate.


I initially had it as a mount_single() file system, where I was thinking
to have an entry under /sys/fs/bpf/, so all subsystems would work on top
of that mount point, but for the same reasons above I lifted that restriction.


I am missing something.

When I suggested using a filesystem it was my thought there would be
exactly one superblock per map, and the map would be specified at mount
time.  You clearly are not implementing that.


I don't think it's practical to have sb per map, since that would mean
sb per prog and that won't scale.
Also map today is an fd that belongs to a process. I cannot see
an api from C program to do 'mount of FD' that wouldn't look like
ugly hack.


A filesystem per map makes sense as you have a key-value store with one
file per key.

The idea is that something resembling your bpf_pin_fd function would be
the mount system call for the filesystem.

The the keys in the map could be read by "ls /mountpoint/".
Key values could be inspected with "cat /mountpoint/key".


yes. that is still the goal for follow up patches, but contained
within given bpffs. Something bpf_pin_fd-like command for bpf syscall
would create files for keys in a map and allow 'cat' via open/read.
Such api would be much cleaner from C app point of view.
Potentially we can allow mount of a file created via BPF_PIN_FD
that will expand into keys/values.


Yeah, sort of making this an optional debugging facility if anything (maybe
to just get a read-only snapshot view). Having maps with a very large number
of entries might end up being problematic by its own, or mapping potential
future map candidates such as rhashtable.


There, actually, the main contention point is 'how to represent keys
and values'. whether key is hex representation or we need some
pretty-printers via format string or via schema? etc, etc.
We tried few ideas of representing keys in our fuse implementations,
but don't have an agreement yet.


That is unclear as well to make it useful.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SoCFPGA ethernet broken

2015-10-16 Thread Andrew Lunn
> > Maybe we need to walk up the hierarchy.
> > 
> > Perhaps something like:
> > 
> > const struct device *dev_walker;
> > 
> > dev_walker = &phydev->dev;
> > do {
> >of_node = dev_walker->of_node;
> >dev_walker = dev_walker->parent;
> > } while (!of_node && dev_walker);
> > 
> 
> The above code seems to have fixed the issue.

What i don't like about this is that it allows you to put these
properties in the mdio device node. These are phy properties, not mdio
properties

If phydev->attached_dev->dev->of_node works, that would be my
preference.

   Andrew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Alexei Starovoitov

On 10/16/15 11:41 AM, Eric W. Biederman wrote:

Daniel Borkmann  writes:


On 10/16/2015 07:42 PM, Alexei Starovoitov wrote:

On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote:

Another question:
Should multiple mount of the filesystem result in an empty fs (a new
instance) or in one were one can see other ebpf-fs entities? I think
Daniel wanted to already use the mountpoint as some kind of hierarchy
delimiter. I would have used directories for that and multiple mounts
would then have resulted in the same content of the filesystem. IMHO
this would remove some ambiguity but then the question arises how this
is handled in a namespaced environment. Was there some specific reason
to do so?


That's an interesting question!
I think all mounts should be independent.
I can see tracing using one and networking using another one
with different hierarchies suitable for their own use cases.
What's an advantage to have the same content everywhere?
Feels harder to manage, since different users would need to
coordinate.


I initially had it as a mount_single() file system, where I was thinking
to have an entry under /sys/fs/bpf/, so all subsystems would work on top
of that mount point, but for the same reasons above I lifted that restriction.


I am missing something.

When I suggested using a filesystem it was my thought there would be
exactly one superblock per map, and the map would be specified at mount
time.  You clearly are not implementing that.


I don't think it's practical to have sb per map, since that would mean
sb per prog and that won't scale.
Also map today is an fd that belongs to a process. I cannot see
an api from C program to do 'mount of FD' that wouldn't look like
ugly hack.


A filesystem per map makes sense as you have a key-value store with one
file per key.

The idea is that something resembling your bpf_pin_fd function would be
the mount system call for the filesystem.

The the keys in the map could be read by "ls /mountpoint/".
Key values could be inspected with "cat /mountpoint/key".


yes. that is still the goal for follow up patches, but contained
within given bpffs. Something bpf_pin_fd-like command for bpf syscall
would create files for keys in a map and allow 'cat' via open/read.
Such api would be much cleaner from C app point of view.
Potentially we can allow mount of a file created via BPF_PIN_FD
that will expand into keys/values.
All of that are our future plans.
There, actually, the main contention point is 'how to represent keys
and values'. whether key is hex representation or we need some
pretty-printers via format string or via schema? etc, etc.
We tried few ideas of representing keys in our fuse implementations,
but don't have an agreement yet.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] openvswitch: Allocate memory for ovs internal device stats.

2015-10-16 Thread Pravin Shelar
On Fri, Oct 16, 2015 at 4:07 AM, James Morse  wrote:
> "openvswitch: Remove vport stats" removed the per-vport statistics, in
> order to use the netdev's statistics fields.
> "openvswitch: Fix ovs_vport_get_stats()" fixed the export of these stats
> to user-space, by using the provided netdev_ops to collate them - but ovs
> internal devices still use an unallocated dev->tstats field to count
> packets, which are no longer exported by this api.
>
> Allocate the dev->tstats field for ovs internal devices, and wire up
> ndo_get_stats64 with the original implementation of
> ovs_vport_get_stats().
>
> On its own, "openvswitch: Fix ovs_vport_get_stats()" fixes the OOPs,
> unmasking a full-on panic on arm64:
> =%<==
> [] internal_dev_recv+0xa8/0x170 [openvswitch]
> [] do_output.isra.31+0x60/0x19c [openvswitch]
> [] do_execute_actions+0x208/0x11c0 [openvswitch]
> [] ovs_execute_actions+0xc8/0x238 [openvswitch]
> [] ovs_packet_cmd_execute+0x21c/0x288 [openvswitch]
> [] genl_family_rcv_msg+0x1b0/0x310
> [] genl_rcv_msg+0xa4/0xe4
> [] netlink_rcv_skb+0xb0/0xdc
> [] genl_rcv+0x38/0x50
> [] netlink_unicast+0x164/0x210
> [] netlink_sendmsg+0x304/0x368
> [] sock_sendmsg+0x30/0x4c
> [SNIP]
> Kernel panic - not syncing: Fatal exception in interrupt
> =%<==
>
> Fixes: 8c876639c985 ("openvswitch: Remove vport stats.")
> Signed-off-by: James Morse 
>
> ---
> Hi netdev,
>
> "openvswitch: Fix ovs_vport_get_stats()" [0], is an incomplete fix for the
> issue in "openvswitch: Remove vport stats.". Use of an unallocated
> dev->tstats remains for ovs internal devices, which causes a panic on arm64.
>
> Change since v1:
> * Removed superfluous rx/tx_dropped increment, already done by 
> dev_get_stats().
>
> Thanks!
>
> James Morse
>
> [0] https://patchwork.ozlabs.org/patch/525827/
>
>  net/openvswitch/vport-internal_dev.c | 40 
> +++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/net/openvswitch/vport-internal_dev.c 
> b/net/openvswitch/vport-internal_dev.c
> index 388b8a6bf112..957fe715b544 100644
> --- a/net/openvswitch/vport-internal_dev.c
> +++ b/net/openvswitch/vport-internal_dev.c
> @@ -106,12 +106,45 @@ static void internal_dev_destructor(struct net_device 
> *dev)
> free_netdev(dev);
>  }
>
...

>  static const struct net_device_ops internal_dev_netdev_ops = {
> .ndo_open = internal_dev_open,
> .ndo_stop = internal_dev_stop,
> .ndo_start_xmit = internal_dev_xmit,
> .ndo_set_mac_address = eth_mac_addr,
> .ndo_change_mtu = internal_dev_change_mtu,
> +   .ndo_get_stats64 = internal_get_stats,
>  };
>
>  static struct rtnl_link_ops internal_dev_link_ops __read_mostly = {
> @@ -161,6 +194,11 @@ static struct vport *internal_dev_create(const struct 
> vport_parms *parms)
> err = -ENOMEM;
> goto error_free_vport;
> }
> +   vport->dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
> +   if (!vport->dev->tstats) {
> +   err = -ENOMEM;
> +   goto error_free_vport;
> +   }
>
Exception handling code is not correct, needs to free netdev here.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SoCFPGA ethernet broken

2015-10-16 Thread Dinh Nguyen
On Fri, 16 Oct 2015, David Daney wrote:

> On 10/16/2015 08:56 AM, Andrew Lunn wrote:
> > > So I think I'll move to inspect what Florian had suggested, and that was
> > > to look
> > > at:
> > > drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c::stmmac_mdio_register
> > 
> > I have a suspicion. If you look at the phy driver it does:
> > 
> > static int ksz9021_config_init(struct phy_device *phydev)
> > {
> >  const struct device *dev = &phydev->dev;
> >  const struct device_node *of_node = dev->of_node;
> > 
> >  if (!of_node && dev->parent->of_node)
> >  of_node = dev->parent->of_node;
> > 
> 
> Maybe we need to walk up the hierarchy.
> 
> Perhaps something like:
> 
> const struct device *dev_walker;
> 
> dev_walker = &phydev->dev;
> do {
>of_node = dev_walker->of_node;
>dev_walker = dev_walker->parent;
> } while (!of_node && dev_walker);
> 

The above code seems to have fixed the issue.

> An alternative would be to assign the bus the same of_node as the bus parent.
> 
> If either approach works, you can add:
>  Acked-by: David Daney 
> 
> to the patch that implements it.
> 
> > 
> > In your case, you don't have a phy node in your device tree, so of_node
> > is NULL. So it looks in the parent device.
> > 
> >  phylib: Make PHYs children of their MDIO bus, not the bus' parent.
> > 
> > changed what the parent is. It is now the mdio device. Before, i
> > suspect it was the MAC. Hence it found your properties in the MAC
> > node.
> > 
> > What i think you might want to do is change this code. Rather than
> > look a dev->parent->of_node; you might want
> > phydev->attached_dev->dev->of_node.
> > 
> > This assumes the phy has been attached to the MAC. I've no idea of the
> > ordering, so maybe it has not been attached yet?
> > 
> > dp83867.c has similar code. However quick grep did not find any
> > mainline users with properties in the MAC node. If that is true, i
> > would suggest removing the code looking in the parent for that phy
> > driver.
> > 
> > Andrew
> > 
> 
> 

BR,
Dinh
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: hisilicon: add OF dependency

2015-10-16 Thread Arnd Bergmann
On Saturday 17 October 2015 02:39:51 kbuild test robot wrote:
> 
> [auto build test WARNING on net-next/master -- if it's inappropriate base, 
> please suggest rules for selecting the more suitable base]
> 
> url:
> https://github.com/0day-ci/linux/commits/Arnd-Bergmann/net-hisilicon-add-OF-dependency/20151016-173818
> config: ia64-allmodconfig (attached as .config)
> reproduce:
> wget 
> https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
>  -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=ia64 
> 

I guess I should have sent my three patches as a series. The build bot
found the same two bugs I had found myself and sent fixes for, but
applying the 'add OF dependency' without the other two makes them appear
in more configurations.

For all I can see, the two reports here are fixed if all three of my
patches are applied.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 net 3/3] openvswitch: Serialize nested ct actions if provided

2015-10-16 Thread Pravin Shelar
On Fri, Oct 16, 2015 at 11:08 AM, Joe Stringer  wrote:
> If userspace provides a ct action with no nested mark or label, then the
> storage for these fields is zeroed. Later when actions are requested,
> such zeroed fields are serialized even though userspace didn't
> originally specify them. Fix the behaviour by ensuring that no action is
> serialized in this case, and reject actions where userspace attempts to
> set these fields with mask=0. This should make netlink marshalling
> consistent across deserialization/reserialization.
>
> Reported-by: Jarno Rajahalme 
> Signed-off-by: Joe Stringer 

Acked-by: Pravin B Shelar 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Eric W. Biederman
Daniel Borkmann  writes:

> On 10/16/2015 07:42 PM, Alexei Starovoitov wrote:
>> On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote:
>>> Another question:
>>> Should multiple mount of the filesystem result in an empty fs (a new
>>> instance) or in one were one can see other ebpf-fs entities? I think
>>> Daniel wanted to already use the mountpoint as some kind of hierarchy
>>> delimiter. I would have used directories for that and multiple mounts
>>> would then have resulted in the same content of the filesystem. IMHO
>>> this would remove some ambiguity but then the question arises how this
>>> is handled in a namespaced environment. Was there some specific reason
>>> to do so?
>>
>> That's an interesting question!
>> I think all mounts should be independent.
>> I can see tracing using one and networking using another one
>> with different hierarchies suitable for their own use cases.
>> What's an advantage to have the same content everywhere?
>> Feels harder to manage, since different users would need to
>> coordinate.
>
> I initially had it as a mount_single() file system, where I was thinking
> to have an entry under /sys/fs/bpf/, so all subsystems would work on top
> of that mount point, but for the same reasons above I lifted that restriction.

I am missing something.

When I suggested using a filesystem it was my thought there would be
exactly one superblock per map, and the map would be specified at mount
time.  You clearly are not implementing that.

A filesystem per map makes sense as you have a key-value store with one
file per key.

The idea is that something resembling your bpf_pin_fd function would be
the mount system call for the filesystem.

The the keys in the map could be read by "ls /mountpoint/".
Key values could be inspected with "cat /mountpoint/key".

That allows all hierarchy etc to be handled in userspace, just as with
my files for namespaces.

I do not understand why you have presented to userspace a magic
filesystem that you allow binding to.  That is not what I intended to
suggest and I do not know how that makes any sense.

Eric


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: hix5hd2_gmac: avoid integer overload warning

2015-10-16 Thread Sergei Shtylyov

On 10/16/2015 09:04 PM, Joe Perches wrote:


BITS_RX_EN is an 'unsigned long' constant, so the ones complement of that
has bits set that do not fit into a 32-bit variable on 64-bit architectures,
which causes a harmless gcc warning:

...

  static void hix5hd2_port_disable(struct hix5hd2_priv *priv)
  {
- writel_relaxed(~(BITS_RX_EN | BITS_TX_EN), priv->base + PORT_EN);
+ writel_relaxed(~(u32)(BITS_RX_EN | BITS_TX_EN), priv->base + PORT_EN);
   writel_relaxed(0, priv->base + DESC_WR_RD_ENA);


ISTM that just means that the constants shouldn't be 'long'.


Right, but that would probably mean changing the BIT() macro or not using it
here. In the past I've argued against using that macro, but I've given
up that fight.


Fight on... (Somebody must have gone to USC here)

There might be value in a BIT_U32 macro.
Maybe BIT_U64 too.


   There's BIT_ULL() already.

MBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 net] openvswitch: Scrub skb between namespaces

2015-10-16 Thread Pravin Shelar
On Fri, Oct 16, 2015 at 11:08 AM, Joe Stringer  wrote:
> If OVS receives a packet from another namespace, then the packet should
> be scrubbed. However, people have already begun to rely on the behaviour
> that skb->mark is preserved across namespaces, so retain this one field.
>
> This is mainly to address information leakage between namespaces when
> using OVS internal ports, but by placing it in ovs_vport_receive() it is
> more generally applicable, meaning it should not be overlooked if other
> port types are allowed to be moved into namespaces in future.
>
> Signed-off-by: Joe Stringer 

Acked-by: Pravin B Shelar 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: hisilicon: add OF dependency

2015-10-16 Thread kbuild test robot
Hi Arnd,

[auto build test WARNING on net-next/master -- if it's inappropriate base, 
please suggest rules for selecting the more suitable base]

url:
https://github.com/0day-ci/linux/commits/Arnd-Bergmann/net-hisilicon-add-OF-dependency/20151016-173818
config: ia64-allmodconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=ia64 

All warnings (new ones prefixed by >>):

   In file included from arch/ia64/include/asm/smp.h:20:0,
from include/linux/smp.h:59,
from include/linux/topology.h:33,
from include/linux/gfp.h:8,
from include/linux/kmod.h:22,
from include/linux/module.h:13,
from drivers/net/ethernet/hisilicon/hix5hd2_gmac.c:10:
   drivers/net/ethernet/hisilicon/hix5hd2_gmac.c: In function 
'hix5hd2_port_disable':
   arch/ia64/include/asm/io.h:398:38: warning: large integer implicitly 
truncated to unsigned type [-Woverflow]
#define writel_relaxed(v,a) __writel((v), (a))
 ^
>> drivers/net/ethernet/hisilicon/hix5hd2_gmac.c:374:2: note: in expansion of 
>> macro 'writel_relaxed'
 writel_relaxed(~(BITS_RX_EN | BITS_TX_EN), priv->base + PORT_EN);
 ^

vim +/writel_relaxed +374 drivers/net/ethernet/hisilicon/hix5hd2_gmac.c

57c5bc9a Zhangfei Gao 2014-06-03  358   writel_relaxed(DEF_INT_MASK, priv->base 
+ ENA_PMU_INT);
57c5bc9a Zhangfei Gao 2014-06-03  359  }
57c5bc9a Zhangfei Gao 2014-06-03  360  
57c5bc9a Zhangfei Gao 2014-06-03  361  static void hix5hd2_irq_disable(struct 
hix5hd2_priv *priv)
57c5bc9a Zhangfei Gao 2014-06-03  362  {
57c5bc9a Zhangfei Gao 2014-06-03  363   writel_relaxed(0, priv->base + 
ENA_PMU_INT);
57c5bc9a Zhangfei Gao 2014-06-03  364  }
57c5bc9a Zhangfei Gao 2014-06-03  365  
57c5bc9a Zhangfei Gao 2014-06-03  366  static void hix5hd2_port_enable(struct 
hix5hd2_priv *priv)
57c5bc9a Zhangfei Gao 2014-06-03  367  {
57c5bc9a Zhangfei Gao 2014-06-03  368   writel_relaxed(0xf, priv->base + 
DESC_WR_RD_ENA);
57c5bc9a Zhangfei Gao 2014-06-03  369   writel_relaxed(BITS_RX_EN | BITS_TX_EN, 
priv->base + PORT_EN);
57c5bc9a Zhangfei Gao 2014-06-03  370  }
57c5bc9a Zhangfei Gao 2014-06-03  371  
57c5bc9a Zhangfei Gao 2014-06-03  372  static void hix5hd2_port_disable(struct 
hix5hd2_priv *priv)
57c5bc9a Zhangfei Gao 2014-06-03  373  {
57c5bc9a Zhangfei Gao 2014-06-03 @374   writel_relaxed(~(BITS_RX_EN | 
BITS_TX_EN), priv->base + PORT_EN);
57c5bc9a Zhangfei Gao 2014-06-03  375   writel_relaxed(0, priv->base + 
DESC_WR_RD_ENA);
57c5bc9a Zhangfei Gao 2014-06-03  376  }
57c5bc9a Zhangfei Gao 2014-06-03  377  
57c5bc9a Zhangfei Gao 2014-06-03  378  static void 
hix5hd2_hw_set_mac_addr(struct net_device *dev)
57c5bc9a Zhangfei Gao 2014-06-03  379  {
57c5bc9a Zhangfei Gao 2014-06-03  380   struct hix5hd2_priv *priv = 
netdev_priv(dev);
57c5bc9a Zhangfei Gao 2014-06-03  381   unsigned char *mac = dev->dev_addr;
57c5bc9a Zhangfei Gao 2014-06-03  382   u32 val;

:: The code at line 374 was first introduced by commit
:: 57c5bc9ad7d799e9507ba6e993398d2c55f03fab net: hisilicon: add hix5hd2 mac 
driver

:: TO: Zhangfei Gao 
:: CC: David S. Miller 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH] net: hisilicon: add OF dependency

2015-10-16 Thread Joe Perches
On Sat, 2015-10-17 at 02:16 +0800, kbuild test robot wrote:
> All errors (new ones prefixed by >>):
>drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c: In function 
> 'hns_dsaf_get_cfg':
> >> drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c:151:3: error: implicit 
> >> declaration of function 'iounmap' [-Werror=implicit-function-declaration]
>   iounmap(dsaf_dev->io_base);
>   ^
>cc1: some warnings being treated as errors
> 
> vim +/iounmap +151 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
[]
> 511e6bc0 huangdaode 2015-09-17  142   if 
> (!dma_set_mask_and_coherent(dsaf_dev->dev, DMA_BIT_MASK(64ULL)))

btw: this should be DMA_BIT_MASK(64) without the ULL



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: hisilicon: add OF dependency

2015-10-16 Thread kbuild test robot
Hi Arnd,

[auto build test ERROR on net-next/master -- if it's inappropriate base, please 
suggest rules for selecting the more suitable base]

url:
https://github.com/0day-ci/linux/commits/Arnd-Bergmann/net-hisilicon-add-OF-dependency/20151016-173818
config: um-allyesconfig (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=um 

All errors (new ones prefixed by >>):

   drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c: In function 
'hns_dsaf_get_cfg':
>> drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c:151:3: error: implicit 
>> declaration of function 'iounmap' [-Werror=implicit-function-declaration]
  iounmap(dsaf_dev->io_base);
  ^
   cc1: some warnings being treated as errors

vim +/iounmap +151 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c

511e6bc0 huangdaode 2015-09-17  135 dsaf_dev->buf_size_type = 
hns_rcb_buf_size2type(buf_size);
511e6bc0 huangdaode 2015-09-17  136 if (dsaf_dev->buf_size_type < 0) {
511e6bc0 huangdaode 2015-09-17  137 dev_err(dsaf_dev->dev,
511e6bc0 huangdaode 2015-09-17  138 "buf_size(%d) is 
wrong!\n", buf_size);
511e6bc0 huangdaode 2015-09-17  139 goto unmap_base_addr;
511e6bc0 huangdaode 2015-09-17  140 }
511e6bc0 huangdaode 2015-09-17  141  
511e6bc0 huangdaode 2015-09-17  142 if 
(!dma_set_mask_and_coherent(dsaf_dev->dev, DMA_BIT_MASK(64ULL)))
511e6bc0 huangdaode 2015-09-17  143 dev_dbg(dsaf_dev->dev, "set 
mask to 64bit\n");
511e6bc0 huangdaode 2015-09-17  144 else
511e6bc0 huangdaode 2015-09-17  145 dev_err(dsaf_dev->dev, "set 
mask to 64bit fail!\n");
511e6bc0 huangdaode 2015-09-17  146  
511e6bc0 huangdaode 2015-09-17  147 return 0;
511e6bc0 huangdaode 2015-09-17  148  
511e6bc0 huangdaode 2015-09-17  149  unmap_base_addr:
511e6bc0 huangdaode 2015-09-17  150 if (dsaf_dev->io_base)
511e6bc0 huangdaode 2015-09-17 @151 iounmap(dsaf_dev->io_base);
511e6bc0 huangdaode 2015-09-17  152 if (dsaf_dev->ppe_base)
511e6bc0 huangdaode 2015-09-17  153 iounmap(dsaf_dev->ppe_base);
511e6bc0 huangdaode 2015-09-17  154 if (dsaf_dev->sds_base)
511e6bc0 huangdaode 2015-09-17  155 iounmap(dsaf_dev->sds_base);
511e6bc0 huangdaode 2015-09-17  156 if (dsaf_dev->sc_base)
511e6bc0 huangdaode 2015-09-17  157 iounmap(dsaf_dev->sc_base);
511e6bc0 huangdaode 2015-09-17  158 if (dsaf_dev->cpld_base)
511e6bc0 huangdaode 2015-09-17  159 iounmap(dsaf_dev->cpld_base);

:: The code at line 151 was first introduced by commit
:: 511e6bc071db1484d1a3d1d0bd4c244cf33910ff net: add Hisilicon Network 
Subsystem DSAF support

:: TO: huangdaode 
:: CC: David S. Miller 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: SoCFPGA ethernet broken

2015-10-16 Thread Florian Fainelli
On 16/10/15 08:56, Andrew Lunn wrote:
>> So I think I'll move to inspect what Florian had suggested, and that was to 
>> look
>> at: drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c::stmmac_mdio_register 
> 
> I have a suspicion. If you look at the phy driver it does:
> 
> static int ksz9021_config_init(struct phy_device *phydev)
> {
> const struct device *dev = &phydev->dev;
> const struct device_node *of_node = dev->of_node;
> 
> if (!of_node && dev->parent->of_node)
> of_node = dev->parent->of_node;
> 
> 
> In your case, you don't have a phy node in your device tree, so of_node
> is NULL. So it looks in the parent device.
> 
> phylib: Make PHYs children of their MDIO bus, not the bus' parent.
> 
> changed what the parent is. It is now the mdio device. Before, i
> suspect it was the MAC. Hence it found your properties in the MAC
> node.
> 
> What i think you might want to do is change this code. Rather than
> look a dev->parent->of_node; you might want
> phydev->attached_dev->dev->of_node.

Yes, that would work, by the time config_init() executes, you are
guaranteed to have a valid attached_dev pointer.

> 
> This assumes the phy has been attached to the MAC. I've no idea of the
> ordering, so maybe it has not been attached yet?
> 
> dp83867.c has similar code. However quick grep did not find any
> mainline users with properties in the MAC node. If that is true, i
> would suggest removing the code looking in the parent for that phy
> driver.

I think we should fix these drivers to look for these properties where
they were used to, as a short term plan, and then make sure that their
binding document is very specific about the fact that these properties,
are by definition, properties of the Ethernet PHY, not the Ethernet MAC.
-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net] openvswitch: Scrub skb between namespaces

2015-10-16 Thread Joe Stringer
If OVS receives a packet from another namespace, then the packet should
be scrubbed. However, people have already begun to rely on the behaviour
that skb->mark is preserved across namespaces, so retain this one field.

This is mainly to address information leakage between namespaces when
using OVS internal ports, but by placing it in ovs_vport_receive() it is
more generally applicable, meaning it should not be overlooked if other
port types are allowed to be moved into namespaces in future.

Signed-off-by: Joe Stringer 
---
v2: Add unlikely(), shift all within the netns check block.
---
 net/openvswitch/vport.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index fc5c0b9ccfe9..12a36ac21eda 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -444,6 +444,15 @@ int ovs_vport_receive(struct vport *vport, struct sk_buff 
*skb,
 
OVS_CB(skb)->input_vport = vport;
OVS_CB(skb)->mru = 0;
+   if (unlikely(dev_net(skb->dev) != ovs_dp_get_net(vport->dp))) {
+   u32 mark;
+
+   mark = skb->mark;
+   skb_scrub_packet(skb, true);
+   skb->mark = mark;
+   tun_info = NULL;
+   }
+
/* Extract flow from 'skb' into 'key'. */
error = ovs_flow_key_extract(tun_info, skb, &key);
if (unlikely(error)) {
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net 2/3] openvswitch: Treat IP_CT_RELATED as new

2015-10-16 Thread Joe Stringer
New, related connections are marked as such as part of ovs_ct_lookup(),
but they are not marked as "new" if the commit flag is used. Make this
consistent by treating IP_CT_RELATED as new as well.

Reported-by: Jarno Rajahalme 
Signed-off-by: Joe Stringer 
Acked-by: Pravin B Shelar 
---
v2: Acked.
---
 net/openvswitch/conntrack.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 80bf702715bb..480dbb9095b7 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -86,6 +86,8 @@ static u8 ovs_ct_get_state(enum ip_conntrack_info ctinfo)
ct_state |= OVS_CS_F_ESTABLISHED;
break;
case IP_CT_RELATED:
+   ct_state |= OVS_CS_F_NEW;
+   /* Fall through */
case IP_CT_RELATED_REPLY:
ct_state |= OVS_CS_F_RELATED;
break;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net 1/3] openvswitch: Reject ct_state masks for unknown bits

2015-10-16 Thread Joe Stringer
Currently, 0-bits are generated in ct_state where the bit position is
undefined, and matches are accepted on these bit-positions. If userspace
requests to match the 0-value for this bit then it may expect only a
subset of traffic to match this value, whereas currently all packets
will have this bit set to 0. Fix this by rejecting such masks.

Signed-off-by: Joe Stringer 
Acked-by: Pravin B Shelar 
---
v2: Remove extraneous ovs_ct_supported() function declaration.
Acked.
---
 net/openvswitch/conntrack.h| 16 +---
 net/openvswitch/flow_netlink.c |  5 -
 2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
index da8714942c95..82e0dfc66028 100644
--- a/net/openvswitch/conntrack.h
+++ b/net/openvswitch/conntrack.h
@@ -35,12 +35,9 @@ void ovs_ct_fill_key(const struct sk_buff *skb, struct 
sw_flow_key *key);
 int ovs_ct_put_key(const struct sw_flow_key *key, struct sk_buff *skb);
 void ovs_ct_free_action(const struct nlattr *a);
 
-static inline bool ovs_ct_state_supported(u32 state)
-{
-   return !(state & ~(OVS_CS_F_NEW | OVS_CS_F_ESTABLISHED |
-OVS_CS_F_RELATED | OVS_CS_F_REPLY_DIR |
-OVS_CS_F_INVALID | OVS_CS_F_TRACKED));
-}
+#define CT_SUPPORTED_MASK (OVS_CS_F_NEW | OVS_CS_F_ESTABLISHED | \
+  OVS_CS_F_RELATED | OVS_CS_F_REPLY_DIR | \
+  OVS_CS_F_INVALID | OVS_CS_F_TRACKED)
 #else
 #include 
 
@@ -53,11 +50,6 @@ static inline bool ovs_ct_verify(struct net *net, int attr)
return false;
 }
 
-static inline bool ovs_ct_state_supported(u32 state)
-{
-   return false;
-}
-
 static inline int ovs_ct_copy_action(struct net *net, const struct nlattr *nla,
 const struct sw_flow_key *key,
 struct sw_flow_actions **acts, bool log)
@@ -94,5 +86,7 @@ static inline int ovs_ct_put_key(const struct sw_flow_key 
*key,
 }
 
 static inline void ovs_ct_free_action(const struct nlattr *a) { }
+
+#define CT_SUPPORTED_MASK 0
 #endif /* CONFIG_NF_CONNTRACK */
 #endif /* ovs_conntrack.h */
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 171a691f1c32..bd710bc37469 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -816,7 +816,7 @@ static int metadata_from_nlattrs(struct net *net, struct 
sw_flow_match *match,
ovs_ct_verify(net, OVS_KEY_ATTR_CT_STATE)) {
u32 ct_state = nla_get_u32(a[OVS_KEY_ATTR_CT_STATE]);
 
-   if (!is_mask && !ovs_ct_state_supported(ct_state)) {
+   if (ct_state & ~CT_SUPPORTED_MASK) {
OVS_NLERR(log, "ct_state flags %08x unsupported",
  ct_state);
return -EINVAL;
@@ -1099,6 +1099,9 @@ static void nlattr_set(struct nlattr *attr, u8 val,
} else {
memset(nla_data(nla), val, nla_len(nla));
}
+
+   if (nla_type(nla) == OVS_KEY_ATTR_CT_STATE)
+   *(u32 *)nla_data(nla) &= CT_SUPPORTED_MASK;
}
 }
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net 3/3] openvswitch: Serialize nested ct actions if provided

2015-10-16 Thread Joe Stringer
If userspace provides a ct action with no nested mark or label, then the
storage for these fields is zeroed. Later when actions are requested,
such zeroed fields are serialized even though userspace didn't
originally specify them. Fix the behaviour by ensuring that no action is
serialized in this case, and reject actions where userspace attempts to
set these fields with mask=0. This should make netlink marshalling
consistent across deserialization/reserialization.

Reported-by: Jarno Rajahalme 
Signed-off-by: Joe Stringer 
---
v2: Reuse labels_nonzero().
Don't check for labels support during execution.
---
 net/openvswitch/conntrack.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 480dbb9095b7..b0287f13491a 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -224,9 +224,6 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct 
sw_flow_key *key,
struct nf_conn *ct;
int err;
 
-   if (!IS_ENABLED(CONFIG_NF_CONNTRACK_LABELS))
-   return -ENOTSUPP;
-
/* The connection could be invalid, in which case set_label is no-op.*/
ct = nf_ct_get(skb, &ctinfo);
if (!ct)
@@ -589,6 +586,10 @@ static int parse_ct(const struct nlattr *attr, struct 
ovs_conntrack_info *info,
case OVS_CT_ATTR_MARK: {
struct md_mark *mark = nla_data(a);
 
+   if (!mark->mask) {
+   OVS_NLERR(log, "ct_mark mask cannot be 0");
+   return -EINVAL;
+   }
info->mark = *mark;
break;
}
@@ -597,6 +598,10 @@ static int parse_ct(const struct nlattr *attr, struct 
ovs_conntrack_info *info,
case OVS_CT_ATTR_LABELS: {
struct md_labels *labels = nla_data(a);
 
+   if (!labels_nonzero(&labels->mask)) {
+   OVS_NLERR(log, "ct_labels mask cannot be 0");
+   return -EINVAL;
+   }
info->labels = *labels;
break;
}
@@ -707,11 +712,12 @@ int ovs_ct_action_to_attr(const struct ovs_conntrack_info 
*ct_info,
if (IS_ENABLED(CONFIG_NF_CONNTRACK_ZONES) &&
nla_put_u16(skb, OVS_CT_ATTR_ZONE, ct_info->zone.id))
return -EMSGSIZE;
-   if (IS_ENABLED(CONFIG_NF_CONNTRACK_MARK) &&
+   if (IS_ENABLED(CONFIG_NF_CONNTRACK_MARK) && ct_info->mark.mask &&
nla_put(skb, OVS_CT_ATTR_MARK, sizeof(ct_info->mark),
&ct_info->mark))
return -EMSGSIZE;
if (IS_ENABLED(CONFIG_NF_CONNTRACK_LABELS) &&
+   labels_nonzero(&ct_info->labels.mask) &&
nla_put(skb, OVS_CT_ATTR_LABELS, sizeof(ct_info->labels),
&ct_info->labels))
return -EMSGSIZE;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: hix5hd2_gmac: avoid integer overload warning

2015-10-16 Thread Joe Perches
On Fri, 2015-10-16 at 13:28 +0200, Arnd Bergmann wrote:
> On Friday 16 October 2015 11:14:44 David Laight wrote:
> > From: Arnd Bergmann
> > > Sent: 16 October 2015 11:01
> > > BITS_RX_EN is an 'unsigned long' constant, so the ones complement of that
> > > has bits set that do not fit into a 32-bit variable on 64-bit 
> > > architectures,
> > > which causes a harmless gcc warning:
> > ...
> > >  static void hix5hd2_port_disable(struct hix5hd2_priv *priv)
> > >  {
> > > - writel_relaxed(~(BITS_RX_EN | BITS_TX_EN), priv->base + PORT_EN);
> > > + writel_relaxed(~(u32)(BITS_RX_EN | BITS_TX_EN), priv->base + 
> > > PORT_EN);
> > >   writel_relaxed(0, priv->base + DESC_WR_RD_ENA);
> > 
> > ISTM that just means that the constants shouldn't be 'long'.
> 
> Right, but that would probably mean changing the BIT() macro or not using it
> here. In the past I've argued against using that macro, but I've given
> up that fight.

Fight on... (Somebody must have gone to USC here)

There might be value in a BIT_U32 macro.
Maybe BIT_U64 too.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Daniel Borkmann

On 10/16/2015 07:42 PM, Alexei Starovoitov wrote:

On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote:

Another question:
Should multiple mount of the filesystem result in an empty fs (a new
instance) or in one were one can see other ebpf-fs entities? I think
Daniel wanted to already use the mountpoint as some kind of hierarchy
delimiter. I would have used directories for that and multiple mounts
would then have resulted in the same content of the filesystem. IMHO
this would remove some ambiguity but then the question arises how this
is handled in a namespaced environment. Was there some specific reason
to do so?


That's an interesting question!
I think all mounts should be independent.
I can see tracing using one and networking using another one
with different hierarchies suitable for their own use cases.
What's an advantage to have the same content everywhere?
Feels harder to manage, since different users would need to
coordinate.


I initially had it as a mount_single() file system, where I was thinking
to have an entry under /sys/fs/bpf/, so all subsystems would work on top
of that mount point, but for the same reasons above I lifted that restriction.

Cheers,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Alexei Starovoitov

On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote:

Another question:
Should multiple mount of the filesystem result in an empty fs (a new
instance) or in one were one can see other ebpf-fs entities? I think
Daniel wanted to already use the mountpoint as some kind of hierarchy
delimiter. I would have used directories for that and multiple mounts
would then have resulted in the same content of the filesystem. IMHO
this would remove some ambiguity but then the question arises how this
is handled in a namespaced environment. Was there some specific reason
to do so?


That's an interesting question!
I think all mounts should be independent.
I can see tracing using one and networking using another one
with different hierarchies suitable for their own use cases.
What's an advantage to have the same content everywhere?
Feels harder to manage, since different users would need to
coordinate.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Thomas Graf
On 10/16/15 at 10:32am, Alexei Starovoitov wrote:
> On 10/16/15 9:43 AM, Hannes Frederic Sowa wrote:
> >Oh, tracing does not allow daemons. Why? I can only imagine embedded
> >users, no?
> 
> yes and for networking: restartability and HA.
> cannot really do that with fuse/daemons.

Right, the smaller the footprint, the better.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Alexei Starovoitov

On 10/16/15 10:27 AM, Daniel Borkmann wrote:

but don't know how flexible we are in
terms of adding S_IFBPF to the UAPI.


I don't think it should be a problem. You referred to POSIX Standard in
your other mail but I can't see any reason why not to establish a new
file mode. Anyway, FreeBSD (e.g. whiteouts) and Solaris (e.g. Doors,
Event Ports) are just examples of new modes being added.

mknod /bpf/map/1 m 1 1

:)

Yes, maybe I think this is a better solution architectural instead of
constructing a new filesystem.


Yeah, also 'man 2 stat' lists a couple of others used by various systems.

The pro's of this approach would be that no new file system would be needed
and the special inode could be placed on top of any 'regular' file system
that would support special files. I do like that as well.


I don't like it at all for the reasons you've just stated:
'it will prevent us doing shell style access to such files'


I'm wondering whether this would prevent us in future from opening access
to shell tools etc on that special file, but probably one could provide a
default set of file ops via init_special_inode() that could be overloaded
by the underlying fs if required.


and also because adding new S_ISSOCK-like bit for bpf feel as a begining
of nightmare, since sooner or later all filesystems would need to have
a check for it like they have for sock type.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Alexei Starovoitov

On 10/16/15 9:43 AM, Hannes Frederic Sowa wrote:

Hi Alexei,

On Fri, Oct 16, 2015, at 18:18, Alexei Starovoitov wrote:

On 10/16/15 3:25 AM, Hannes Frederic Sowa wrote:

Namespaces at some point dealt with the same problem, they nowadays use
bind mounts of/proc/$$/ns/* to some place in the file hierarchy to keep
the namespace alive. This at least allows someone to build up its own
hierarchy with normal unix tools and not hidden inside a C-program. For
filedescriptors we already have/proc/$$/fd/* but it seems that doesn't
work out of the box nowadays.


bind mounting of /proc/../fd was initially proposed by Andy and we've
looked at it thoroughly, but after discussion with Eric it became
apparent that it doesn't fit here. At the end we need shell tools
to access maps.


Oh yes, I want shell tools for this very much! Maybe even that things
like strings, grep etc. work. :)


yes and the only way to get there is to have it done via fs.


Also I think you missed the hierarchy in this patch set _is_ built with
normal 'mkdir' and files are removed with 'rm'.


I did not miss that, I am just concerned that if the kernel does not
enforce such a hierarchy automatically it won't really happen.


if it's easier for user to work with single level of directories,
it should be able to do so. It's not a job of the kernel to enforce
how user space apps should be designed.


Oh, tracing does not allow daemons. Why? I can only imagine embedded
users, no?


yes and for networking: restartability and HA.
cannot really do that with fuse/daemons.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Daniel Borkmann

On 10/16/2015 06:36 PM, Hannes Frederic Sowa wrote:

On Fri, Oct 16, 2015, at 15:36, Daniel Borkmann wrote:

On 10/16/2015 12:25 PM, Hannes Frederic Sowa wrote:

On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote:

This eventually leads us to this patch, which implements a minimal
eBPF file system. The idea is a bit similar, but to the point that
these inodes reside at one or multiple mount points. A directory
hierarchy can be tailored to a specific application use-case from the
various subsystem users and maps/progs pinned inside it. Two new eBPF
commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
order to create one or multiple special inodes from an existing file
descriptor that points to a map/program (we call it eBPF fd pinning),
or to create a new file descriptor from an existing special inode.
BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
can also be done unpriviledged when having appropriate permissions
to the path.


In my opinion this is very un-unixiy, I have to say at least.

Namespaces at some point dealt with the same problem, they nowadays use
bind mounts of /proc/$$/ns/* to some place in the file hierarchy to keep
the namespace alive. This at least allows someone to build up its own
hierarchy with normal unix tools and not hidden inside a C-program. For
filedescriptors we already have /proc/$$/fd/* but it seems that doesn't
work out of the box nowadays.


Yes, that doesn't work out of the box, but I also don't know how usable
that would really be. The idea is roughly rather similar to the paths
passed to bind(2)/connect(2) on Unix domain sockets, as mentioned. You
have a map/prog resource that you stick to a special inode so that you
can retrieve it at a later point in time from the same or different
processes through a new fd pointing to the resource from user side, so
that the bpf(2) syscall can be performed upon it.

With Unix tools, you could still create/remove a hierarchy or unlink
those that have maps/progs. You are correct that tools that don't
implement bpf(2) currently cannot access the content behind it, since
bpf(2) manages access to the data itself. I did like the 2nd idea though,
mentioned in the commit log, but don't know how flexible we are in
terms of adding S_IFBPF to the UAPI.


I don't think it should be a problem. You referred to POSIX Standard in
your other mail but I can't see any reason why not to establish a new
file mode. Anyway, FreeBSD (e.g. whiteouts) and Solaris (e.g. Doors,
Event Ports) are just examples of new modes being added.

mknod /bpf/map/1 m 1 1

:)

Yes, maybe I think this is a better solution architectural instead of
constructing a new filesystem.


Yeah, also 'man 2 stat' lists a couple of others used by various systems.

The pro's of this approach would be that no new file system would be needed
and the special inode could be placed on top of any 'regular' file system
that would support special files. I do like that as well.

I'm wondering whether this would prevent us in future from opening access
to shell tools etc on that special file, but probably one could provide a
default set of file ops via init_special_inode() that could be overloaded
by the underlying fs if required.


I don't know in terms of how many objects bpf should be able to handle
and if such a bind-mount based solution would work, I guess not.

In my opinion I still favor a user space approach. Subsystems which use
ebpf in a way that no user space program needs to be running to control
them would need to export the fds by itself. E.g. something like
sysfs/kobject for tc? The hierarchy would then be in control of the
subsystem which could also create a proper naming hierarchy or maybe
even use an already given one. Do most other eBPF users really need to
persist file descriptors somewhere without user space control and pick
them up later?


I was thinking about a strict predefined hierarchy dictated by the kernel
as well, but was then considering a more flexible approach that could be
tailored freely to various use cases. A predefined hierarchy would most
likely need to be resolved per subsystem and it's not really easy to map
this properly. F.e. if the kernel would try to provide unique ids (as
opposed to have a name or annotation member through the syscall), it
could end up being quite cryptic. If we let the users choose names, I'm
not sure if a single hierarchy level would be enough. Then, additionally
you have facilities like tail calls that eBPF programs could do.


I don't think that most subsystems need to expose those file
descriptors. Seccomp probably will have a supervisor process running and
per aggregation will also have a user space process running keeping the
fd alive. So it is all about tc/sched.

And I am not sure if tc will really needs a filesystem to handle all
this. The simplest approach is to just keep a name <-> fd mapping
somewhere in the net/sched/ subsystem and use this for all tc users.


Solving this on a generic 

Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Hannes Frederic Sowa
On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote:
> This eventually leads us to this patch, which implements a minimal
> eBPF file system. The idea is a bit similar, but to the point that
> these inodes reside at one or multiple mount points. A directory
> hierarchy can be tailored to a specific application use-case from the
> various subsystem users and maps/progs pinned inside it. Two new eBPF
> commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
> order to create one or multiple special inodes from an existing file
> descriptor that points to a map/program (we call it eBPF fd pinning),
> or to create a new file descriptor from an existing special inode.
> BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
> can also be done unpriviledged when having appropriate permissions
> to the path.
> 

Another question:
Should multiple mount of the filesystem result in an empty fs (a new
instance) or in one were one can see other ebpf-fs entities? I think
Daniel wanted to already use the mountpoint as some kind of hierarchy
delimiter. I would have used directories for that and multiple mounts
would then have resulted in the same content of the filesystem. IMHO
this would remove some ambiguity but then the question arises how this
is handled in a namespaced environment. Was there some specific reason
to do so?

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch net-next v5 3/8] switchdev: allow caller to explicitly request attr_set as deferred

2015-10-16 Thread John Fastabend
On 15-10-16 09:20 AM, Scott Feldman wrote:
> On Fri, Oct 16, 2015 at 1:23 AM, Jiri Pirko  wrote:
>> Thu, Oct 15, 2015 at 05:21:22PM CEST, john.fastab...@gmail.com wrote:
>>> On 15-10-14 10:40 AM, Jiri Pirko wrote:
 From: Jiri Pirko 

 Caller should know if he can call attr_set directly (when holding RTNL)
 or if he has to defer the att_set processing for later.

 This also allows drivers to sleep inside attr_set and report operation
 status back to switchdev core. Switchdev core then warns if status is
 not ok, instead of silent errors happening in drivers.

 Benefit from newly introduced switchdev deferred ops infrastructure.

>>>
>>> A nit but the patch description should note your setting the defer bit
>>> on the bridge set state.
>>>
 Signed-off-by: Jiri Pirko 
 ---
  include/net/switchdev.h   |   1 +
  net/bridge/br_stp.c   |   3 +-
  net/switchdev/switchdev.c | 108 
 ++
  3 files changed, 46 insertions(+), 66 deletions(-)

 diff --git a/include/net/switchdev.h b/include/net/switchdev.h
 index d1c7f90..f7de6f8 100644
 --- a/include/net/switchdev.h
 +++ b/include/net/switchdev.h
 @@ -17,6 +17,7 @@

  #define SWITCHDEV_F_NO_RECURSE  BIT(0)
  #define SWITCHDEV_F_SKIP_EOPNOTSUPP BIT(1)
 +#define SWITCHDEV_F_DEFER   BIT(2)

  struct switchdev_trans_item {
  struct list_head list;
 diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
 index db6d243de..80c34d7 100644
 --- a/net/bridge/br_stp.c
 +++ b/net/bridge/br_stp.c
 @@ -41,13 +41,14 @@ void br_set_state(struct net_bridge_port *p, unsigned 
 int state)
  {
  struct switchdev_attr attr = {
  .id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
 +.flags = SWITCHDEV_F_DEFER,
  .u.stp_state = state,
  };
>>>
>>>
>>> This creates a possible race (with 6/8) I think, please check!
>>
>> Wait. This patch does not change the previous behaviour. Patch 6 does,
>> so I don't understand why you are asking here. Confusing.
>>
>>
>>>
>>> In del_nbp() we call br_stp_disable_port() to set the port state
>>> to BR_STATE_DISABLE and disabling learning events. But with this
>>> patch it can be deferred. Also note the STP agent may be in userspace
>>> which actually seems more likely the case because you likely want to
>>> run some more modern variant of STP than the kernel supports.
>>>
>>> So at some point in the future the driver will turn off learning. At
>>> the same time we call br_fdb_delete_by_port which calls a deferred
>>> set of fdb deletes.
>>>
>>> I don't see how you guarantee learning is off before you start doing
>>> the deletes here and possibly learning new addresses after the software
>>> side believes the port is down.
>>>
>>> So
>>>
>>>   br_stp_disable_port
>>>   br_fdb_delete_by_port
>>>   {fdb_del_external_learn}
>>>   [hw learns a fdb]
>>>   [hw disables learning]
>>>
>>> What stops this from happening?
>>
>> Okay. This behaviour is the same as without the patchset. What would
>> resolve the issue it to put switchdev_deferred_process() after
>> br_stp_disable_port() and before br_fdb_delete_by_port() call.
>> That would enforce stp change to happen in hw before fdbs are explicitly
>> deleted. Sound good to you?
> 
> Doesn't HW already see things in the right order since items are
> dequeued from the deferred list in the order queued?
> 

I just noticed there is the rtnl lock around the deferred process so
this should mean only a single deferred task is run at a time. OK sort
of a big hammer though.

+static void switchdev_deferred_process_work(struct work_struct *work)
+{
+   rtnl_lock();
+   switchdev_deferred_process();
+   rtnl_unlock();
+}

But I don't see how this block of code works,

spin_lock_bh(&br->lock);
br_stp_disable_port(p);
spin_unlock_bh(&br->lock);

br_ifinfo_notify(RTM_DELLINK, p);

list_del_rcu(&p->list);

nbp_vlan_flush(p);
br_fdb_delete_by_port(br, p, 0, 1);
switchdev_deferred_process();


If you defer disabling learning then setup a bunch of deletes also
deferred, what guarantees that learning was disabled before you setup
the deferred fdb delete list?

It seems you may call rocker_port_fdb_learn_work() after the
br_fdb_delete_by_port call above but before the deferred learn disable
call which would have disabled it. Also the rocker_port_fdb_learn_work
queue is on a lw->work and doesn't use the same locking?

Its sort of an interesting hieararchy of work queues being built up
here. I'm being a bit overly paranoid at this point because I've seen
some really ugly bugs around these types of things that are difficult
to spot because they are correctness bugs instead of hard crashes.

.John


--
To unsubscribe from this list: send the line "

Re: [PATCH nf-next 1/6] netfilter-ipv4: Line layout whitespace fixes

2015-10-16 Thread Pablo Neira Ayuso
On Wed, Oct 14, 2015 at 11:17:03PM +0100, Ian Morris wrote:
> Cleanses some whitespace issues by removing a leading space before a tab.
> 
> No changes detected by objdiff.
> 
> Signed-off-by: Ian Morris 
> ---
>  net/ipv4/netfilter/ipt_ECN.c   | 2 +-
>  net/ipv4/netfilter/nf_nat_pptp.c   | 2 +-
>  net/ipv4/netfilter/nf_nat_snmp_basic.c | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/netfilter/ipt_ECN.c b/net/ipv4/netfilter/ipt_ECN.c
> index 2707652..6592708 100644
> --- a/net/ipv4/netfilter/ipt_ECN.c
> +++ b/net/ipv4/netfilter/ipt_ECN.c
> @@ -24,7 +24,7 @@ MODULE_AUTHOR("Harald Welte ");
>  MODULE_DESCRIPTION("Xtables: Explicit Congestion Notification (ECN) flag 
> modification");
>  
>  /* set ECT codepoint from IP header.
> - *   return false if there was an error. */
> + *   return false if there was an error. */

In netdev coding style we prefer for multiline comments:

/* This is a comment blah blah blah blah blah blah blah blah blah blah blah blah
 * blah blah blah.
 */

For single line:

/* This is a comment blah blah blah */

This case, I suggest we can even get rid of that comment there since
it's obvious what set_ect_ip() is doing by reading the function name
and looking at what it returns.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf-next 0/6] coding style improvements: netfilter-ipv4

2015-10-16 Thread Pablo Neira Ayuso
On Wed, Oct 14, 2015 at 11:17:02PM +0100, Ian Morris wrote:
> This series of patches improves the coding style of the netfilter-ipv4 
> code by addressing some issues detected by checkpatch.
> 
> The changes were previously submitted as part of a larger monolithic 
> patch but on advice from Pablo, these are being re-sent in smaller, 
> more structured batches.

Series applied, except patch 1/6 that I have kept back. I'll follow up
with an explanation.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: hisilicon: add OF dependency

2015-10-16 Thread kbuild test robot
Hi Arnd,

[auto build test ERROR on net-next/master -- if it's inappropriate base, please 
suggest rules for selecting the more suitable base]

url:
https://github.com/0day-ci/linux/commits/Arnd-Bergmann/net-hisilicon-add-OF-dependency/20151016-173818
config: x86_64-allmodconfig
reproduce:
make ARCH=x86_64  allmodconfig
make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):


vim +/vzalloc +1123 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c

511e6bc0 huangdaode 2015-09-17  1117  
511e6bc0 huangdaode 2015-09-17  1118ret = hns_dsaf_init_hw(dsaf_dev);
511e6bc0 huangdaode 2015-09-17  1119if (ret)
511e6bc0 huangdaode 2015-09-17  1120return ret;
511e6bc0 huangdaode 2015-09-17  1121  
511e6bc0 huangdaode 2015-09-17  1122/* malloc mem for tcam mac 
key(vlan+mac) */
511e6bc0 huangdaode 2015-09-17 @1123priv->soft_mac_tbl = 
vzalloc(sizeof(*priv->soft_mac_tbl)
511e6bc0 huangdaode 2015-09-17  1124  * DSAF_TCAM_SUM);
511e6bc0 huangdaode 2015-09-17  1125if (!priv->soft_mac_tbl) {
511e6bc0 huangdaode 2015-09-17  1126ret = -ENOMEM;
511e6bc0 huangdaode 2015-09-17  1127goto remove_hw;
511e6bc0 huangdaode 2015-09-17  1128}
511e6bc0 huangdaode 2015-09-17  1129  
511e6bc0 huangdaode 2015-09-17  1130/*all entry invall */
511e6bc0 huangdaode 2015-09-17  1131for (i = 0; i < DSAF_TCAM_SUM; i++)
511e6bc0 huangdaode 2015-09-17  1132(priv->soft_mac_tbl + i)->index 
= DSAF_INVALID_ENTRY_IDX;
511e6bc0 huangdaode 2015-09-17  1133  
511e6bc0 huangdaode 2015-09-17  1134return 0;
511e6bc0 huangdaode 2015-09-17  1135  
511e6bc0 huangdaode 2015-09-17  1136  remove_hw:
511e6bc0 huangdaode 2015-09-17  1137hns_dsaf_remove_hw(dsaf_dev);
511e6bc0 huangdaode 2015-09-17  1138return ret;
511e6bc0 huangdaode 2015-09-17  1139  }
511e6bc0 huangdaode 2015-09-17  1140  
511e6bc0 huangdaode 2015-09-17  1141  /**
511e6bc0 huangdaode 2015-09-17  1142   * hns_dsaf_free - free dsa fabric
511e6bc0 huangdaode 2015-09-17  1143   * @dsaf_dev: dsa fabric device struct 
pointer
511e6bc0 huangdaode 2015-09-17  1144   */
511e6bc0 huangdaode 2015-09-17  1145  static void hns_dsaf_free(struct 
dsaf_device *dsaf_dev)
511e6bc0 huangdaode 2015-09-17  1146  {
511e6bc0 huangdaode 2015-09-17  1147struct dsaf_drv_priv *priv =
511e6bc0 huangdaode 2015-09-17  1148(struct dsaf_drv_priv 
*)hns_dsaf_dev_priv(dsaf_dev);
511e6bc0 huangdaode 2015-09-17  1149  
511e6bc0 huangdaode 2015-09-17  1150hns_dsaf_remove_hw(dsaf_dev);
511e6bc0 huangdaode 2015-09-17  1151  
511e6bc0 huangdaode 2015-09-17  1152/* free all mac mem */
511e6bc0 huangdaode 2015-09-17 @1153vfree(priv->soft_mac_tbl);
511e6bc0 huangdaode 2015-09-17  1154priv->soft_mac_tbl = NULL;
511e6bc0 huangdaode 2015-09-17  1155  }
511e6bc0 huangdaode 2015-09-17  1156  

:: The code at line 1123 was first introduced by commit
:: 511e6bc071db1484d1a3d1d0bd4c244cf33910ff net: add Hisilicon Network 
Subsystem DSAF support

:: TO: huangdaode 
:: CC: David S. Miller 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SoCFPGA ethernet broken

2015-10-16 Thread David Daney

On 10/16/2015 08:56 AM, Andrew Lunn wrote:

So I think I'll move to inspect what Florian had suggested, and that was to look
at: drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c::stmmac_mdio_register


I have a suspicion. If you look at the phy driver it does:

static int ksz9021_config_init(struct phy_device *phydev)
{
 const struct device *dev = &phydev->dev;
 const struct device_node *of_node = dev->of_node;

 if (!of_node && dev->parent->of_node)
 of_node = dev->parent->of_node;



Maybe we need to walk up the hierarchy.

Perhaps something like:

const struct device *dev_walker;

dev_walker = &phydev->dev;
do {
   of_node = dev_walker->of_node;
   dev_walker = dev_walker->parent;
} while (!of_node && dev_walker);

An alternative would be to assign the bus the same of_node as the bus 
parent.


If either approach works, you can add:
 Acked-by: David Daney 

to the patch that implements it.



In your case, you don't have a phy node in your device tree, so of_node
is NULL. So it looks in the parent device.

 phylib: Make PHYs children of their MDIO bus, not the bus' parent.

changed what the parent is. It is now the mdio device. Before, i
suspect it was the MAC. Hence it found your properties in the MAC
node.

What i think you might want to do is change this code. Rather than
look a dev->parent->of_node; you might want
phydev->attached_dev->dev->of_node.

This assumes the phy has been attached to the MAC. I've no idea of the
ordering, so maybe it has not been attached yet?

dp83867.c has similar code. However quick grep did not find any
mainline users with properties in the MAC node. If that is true, i
would suggest removing the code looking in the parent for that phy
driver.

Andrew



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Hannes Frederic Sowa
Hi Alexei,

On Fri, Oct 16, 2015, at 18:18, Alexei Starovoitov wrote:
> On 10/16/15 3:25 AM, Hannes Frederic Sowa wrote:
> > Namespaces at some point dealt with the same problem, they nowadays use
> > bind mounts of/proc/$$/ns/* to some place in the file hierarchy to keep
> > the namespace alive. This at least allows someone to build up its own
> > hierarchy with normal unix tools and not hidden inside a C-program. For
> > filedescriptors we already have/proc/$$/fd/* but it seems that doesn't
> > work out of the box nowadays.
> 
> bind mounting of /proc/../fd was initially proposed by Andy and we've
> looked at it thoroughly, but after discussion with Eric it became
> apparent that it doesn't fit here. At the end we need shell tools
> to access maps.

Oh yes, I want shell tools for this very much! Maybe even that things
like strings, grep etc. work. :)

> Also I think you missed the hierarchy in this patch set _is_ built with
> normal 'mkdir' and files are removed with 'rm'.

I did not miss that, I am just concerned that if the kernel does not
enforce such a hierarchy automatically it won't really happen.

> The only thing that C does is BPF_PIN_FD of fd that was received from
> bpf syscall. That's way cleaner api than doing bind mount from C
> program.

I am with you there. Unfortunately we don't have a give "this fd a name"
syscalls so far so I totally understand the decision here.

> We've considered letting open() of the file return bpf specific
> anon-inode, but decided to reserve that for other more natural file
> operations. Therefore BPF_NEW_FD is needed.

Can't this be overloaded somehow. You can use mknod for creation and
open for regular file use. mknod is its own syscall.

> > I don't know in terms of how many objects bpf should be able to handle
> > and if such a bind-mount based solution would work, I guess not.
> 
> We definitely missed you at the last plumbers where it was discussed :)

Yes. :(

> > In my opinion I still favor a user space approach.
> 
> that's not acceptable for tracing use cases. No daemons allowed.

Oh, tracing does not allow daemons. Why? I can only imagine embedded
users, no?

> > Subsystems which use
> > ebpf in a way that no user space program needs to be running to control
> > them would need to export the fds by itself. E.g. something like
> > sysfs/kobject for tc? The hierarchy would then be in control of the
> > subsystem which could also create a proper naming hierarchy or maybe
> > even use an already given one. Do most other eBPF users really need to
> > persist file descriptors somewhere without user space control and pick
> > them up later?
> 
> I think it's way cleaner to have one way of solving it (like this patch
> does) instead of asking every subsystem to solve it differently.
> We've also looked at sysfs and it's ugly when it comes to removing,
> since the user cannot use normal 'rm'.

Ah, okay. Probably it would depend on some tc node always referencing
the bpf entity. But I see that sysfs might become too problematic.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] netfilter: turn NF_HOOK into an inline function

2015-10-16 Thread Pablo Neira Ayuso
On Fri, Oct 09, 2015 at 08:45:42PM +0200, Arnd Bergmann wrote:
> A recent change to the dst_output handling caused a new warning
> when the call to NF_HOOK() is the only used of a local variable
> passed as 'dev', and CONFIG_NETFILTER is disabled:
> 
> net/ipv6/ip6_output.c: In function 'ip6_output':
> net/ipv6/ip6_output.c:135:21: warning: unused variable 'dev' 
> [-Wunused-variable]
> 
> The reason for this is that the NF_HOOK macro in this case does
> not reference the variable at all, and the call to dev_net(dev)
> got removed from the ip6_output function. To avoid that warning now
> and in the future, this changes the macro into an equivalent
> inline function, which tells the compiler that the variable is
> passed correctly but still unused.
> 
> The dn_forward function apparently had the same problem in
> the past and added a local workaround that no longer works
> with the inline function. In order to avoid a regression, we
> have to also remove the #ifdef from decnet in the same patch.

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Hannes Frederic Sowa
On Fri, Oct 16, 2015, at 15:36, Daniel Borkmann wrote:
> On 10/16/2015 12:25 PM, Hannes Frederic Sowa wrote:
> > On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote:
> >> This eventually leads us to this patch, which implements a minimal
> >> eBPF file system. The idea is a bit similar, but to the point that
> >> these inodes reside at one or multiple mount points. A directory
> >> hierarchy can be tailored to a specific application use-case from the
> >> various subsystem users and maps/progs pinned inside it. Two new eBPF
> >> commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
> >> order to create one or multiple special inodes from an existing file
> >> descriptor that points to a map/program (we call it eBPF fd pinning),
> >> or to create a new file descriptor from an existing special inode.
> >> BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
> >> can also be done unpriviledged when having appropriate permissions
> >> to the path.
> >
> > In my opinion this is very un-unixiy, I have to say at least.
> >
> > Namespaces at some point dealt with the same problem, they nowadays use
> > bind mounts of /proc/$$/ns/* to some place in the file hierarchy to keep
> > the namespace alive. This at least allows someone to build up its own
> > hierarchy with normal unix tools and not hidden inside a C-program. For
> > filedescriptors we already have /proc/$$/fd/* but it seems that doesn't
> > work out of the box nowadays.
> 
> Yes, that doesn't work out of the box, but I also don't know how usable
> that would really be. The idea is roughly rather similar to the paths
> passed to bind(2)/connect(2) on Unix domain sockets, as mentioned. You
> have a map/prog resource that you stick to a special inode so that you
> can retrieve it at a later point in time from the same or different
> processes through a new fd pointing to the resource from user side, so
> that the bpf(2) syscall can be performed upon it.
> 
> With Unix tools, you could still create/remove a hierarchy or unlink
> those that have maps/progs. You are correct that tools that don't
> implement bpf(2) currently cannot access the content behind it, since
> bpf(2) manages access to the data itself. I did like the 2nd idea though,
> mentioned in the commit log, but don't know how flexible we are in
> terms of adding S_IFBPF to the UAPI.

I don't think it should be a problem. You referred to POSIX Standard in
your other mail but I can't see any reason why not to establish a new
file mode. Anyway, FreeBSD (e.g. whiteouts) and Solaris (e.g. Doors,
Event Ports) are just examples of new modes being added.

mknod /bpf/map/1 m 1 1

:)

Yes, maybe I think this is a better solution architectural instead of
constructing a new filesystem.

> > I don't know in terms of how many objects bpf should be able to handle
> > and if such a bind-mount based solution would work, I guess not.
> >
> > In my opinion I still favor a user space approach. Subsystems which use
> > ebpf in a way that no user space program needs to be running to control
> > them would need to export the fds by itself. E.g. something like
> > sysfs/kobject for tc? The hierarchy would then be in control of the
> > subsystem which could also create a proper naming hierarchy or maybe
> > even use an already given one. Do most other eBPF users really need to
> > persist file descriptors somewhere without user space control and pick
> > them up later?
> 
> I was thinking about a strict predefined hierarchy dictated by the kernel
> as well, but was then considering a more flexible approach that could be
> tailored freely to various use cases. A predefined hierarchy would most
> likely need to be resolved per subsystem and it's not really easy to map
> this properly. F.e. if the kernel would try to provide unique ids (as
> opposed to have a name or annotation member through the syscall), it
> could end up being quite cryptic. If we let the users choose names, I'm
> not sure if a single hierarchy level would be enough. Then, additionally
> you have facilities like tail calls that eBPF programs could do.

I don't think that most subsystems need to expose those file
descriptors. Seccomp probably will have a supervisor process running and
per aggregation will also have a user space process running keeping the
fd alive. So it is all about tc/sched.

And I am not sure if tc will really needs a filesystem to handle all
this. The simplest approach is to just keep a name <-> fd mapping
somewhere in the net/sched/ subsystem and use this for all tc users.
Otherwise can we somehow Incorporate this in sysfs directory where we
maybe create a kobject per installed filter, something along those
lines.

I see that tail calls makes this all very difficult to show which entity
uses which ebpf entity in some way, as it looks like n:m relationships.

> In such cases, one could even craft relationships where a (strict auto
> generated) tree representation would not be sufficient (f.e.
> recirculati

Re: [PATCH] ipv6: no addrconf for slave devices

2015-10-16 Thread Jan Blunck
On Fri, Oct 16, 2015 at 6:14 PM, David Ahern  wrote:
> On 10/16/15 10:12 AM, Jan Blunck wrote:
>>
>> On Fri, Oct 16, 2015 at 6:02 PM, David Ahern 
>> wrote:
>>>
>>> On 10/16/15 9:57 AM, Jan Blunck wrote:



 I don't think that enslaved ports should get network layer addresses.
 This is one example with a team device:
>>>
>>>
>>>
>>> for VRF devices we do want the enslaved links to have link local
>>> addresses.
>>>
>>
>> That is interesting. As far I can see you are setting IFF_SLAVE in
>> do_vrf_add_slave() and therefore already stop IPv6 addrconf.
>>
>
> Check net-next. That had to be removed to get IPv6 working.
>

Thanks for the pointer.

So it would be better to differentiate between L2 and L3 ports and
only start addrconf on later ones? I don't think there is a flag that
allows for that though.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: phy: smsc: disable energy detect mode

2015-10-16 Thread Florian Fainelli
2015-10-13 21:17 GMT-07:00 Heiko Schocher :
> Hello Florian,
>
>
> Am 13.10.2015 um 21:26 schrieb Florian Fainelli:
>>
>> On 12/10/15 22:13, Heiko Schocher wrote:
>>>
>>> On some boards the energy enable detect mode leads in
>>> trouble with some switches, so make the enabling of
>>> this mode configurable through DT.
>>>
>>> Signed-off-by: Heiko Schocher 
>>> ---
>>>
>>>   .../devicetree/bindings/net/smsc-lan87xx.txt   | 19
>>> +
>>>   drivers/net/phy/smsc.c | 24
>>> +-
>>>   2 files changed, 38 insertions(+), 5 deletions(-)
>>>   create mode 100644
>>> Documentation/devicetree/bindings/net/smsc-lan87xx.txt
>>>
>>> diff --git a/Documentation/devicetree/bindings/net/smsc-lan87xx.txt
>>> b/Documentation/devicetree/bindings/net/smsc-lan87xx.txt
>>> new file mode 100644
>>> index 000..39aa1dc
>>> --- /dev/null
>>> +++ b/Documentation/devicetree/bindings/net/smsc-lan87xx.txt
>>> @@ -0,0 +1,19 @@
>>> +SMSC LAN87xx Ethernet PHY
>>> +
>>> +Some boards require special tuning values. Configure them
>>> +through an Ethernet OF device node.
>>> +
>>> +Optional properties:
>>> +
>>> +- disable-energy-detect:
>>> +  If set, do not enable energy detect mode for the SMSC phy.
>>> +  default: enable energy detect mode
>>
>>
>> Although energy detection is something that is implemented by many PHYs,
>> I am not sure a generic property is suitable here, I would prefix that
>> with the SMSC vendor prefix here to make it clear this only applies to
>> this PHY.
>
>
> Hmm... but all PHYs should be able to enable, disable it in some way, or?

It may not always be controlled directly at the PHY level, sometimes
this is something that needs cooperation with the Ethernet MAC as well
in case of integrated designs.

>
>> Would not you want to make it a reverse property here though, something
>> like this:
>>
>> smsc,energy-detect: boolean, when present indicates the PHY reliably
>> supports energy detection
>
>
> Yes, that was also my first thought, but currently, on this PHYs
> energy detect mode is on ... and if I introduce such a property,
> it will disable it for all existing boards, because property is
> missing ... so, maybe I break boards ...

Fair enough, how about smsc,disabled-energy-detect or something like that then?
-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch net-next v5 3/8] switchdev: allow caller to explicitly request attr_set as deferred

2015-10-16 Thread Scott Feldman
On Fri, Oct 16, 2015 at 1:23 AM, Jiri Pirko  wrote:
> Thu, Oct 15, 2015 at 05:21:22PM CEST, john.fastab...@gmail.com wrote:
>>On 15-10-14 10:40 AM, Jiri Pirko wrote:
>>> From: Jiri Pirko 
>>>
>>> Caller should know if he can call attr_set directly (when holding RTNL)
>>> or if he has to defer the att_set processing for later.
>>>
>>> This also allows drivers to sleep inside attr_set and report operation
>>> status back to switchdev core. Switchdev core then warns if status is
>>> not ok, instead of silent errors happening in drivers.
>>>
>>> Benefit from newly introduced switchdev deferred ops infrastructure.
>>>
>>
>>A nit but the patch description should note your setting the defer bit
>>on the bridge set state.
>>
>>> Signed-off-by: Jiri Pirko 
>>> ---
>>>  include/net/switchdev.h   |   1 +
>>>  net/bridge/br_stp.c   |   3 +-
>>>  net/switchdev/switchdev.c | 108 
>>> ++
>>>  3 files changed, 46 insertions(+), 66 deletions(-)
>>>
>>> diff --git a/include/net/switchdev.h b/include/net/switchdev.h
>>> index d1c7f90..f7de6f8 100644
>>> --- a/include/net/switchdev.h
>>> +++ b/include/net/switchdev.h
>>> @@ -17,6 +17,7 @@
>>>
>>>  #define SWITCHDEV_F_NO_RECURSE  BIT(0)
>>>  #define SWITCHDEV_F_SKIP_EOPNOTSUPP BIT(1)
>>> +#define SWITCHDEV_F_DEFER   BIT(2)
>>>
>>>  struct switchdev_trans_item {
>>>  struct list_head list;
>>> diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
>>> index db6d243de..80c34d7 100644
>>> --- a/net/bridge/br_stp.c
>>> +++ b/net/bridge/br_stp.c
>>> @@ -41,13 +41,14 @@ void br_set_state(struct net_bridge_port *p, unsigned 
>>> int state)
>>>  {
>>>  struct switchdev_attr attr = {
>>>  .id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
>>> +.flags = SWITCHDEV_F_DEFER,
>>>  .u.stp_state = state,
>>>  };
>>
>>
>>This creates a possible race (with 6/8) I think, please check!
>
> Wait. This patch does not change the previous behaviour. Patch 6 does,
> so I don't understand why you are asking here. Confusing.
>
>
>>
>>In del_nbp() we call br_stp_disable_port() to set the port state
>>to BR_STATE_DISABLE and disabling learning events. But with this
>>patch it can be deferred. Also note the STP agent may be in userspace
>>which actually seems more likely the case because you likely want to
>>run some more modern variant of STP than the kernel supports.
>>
>>So at some point in the future the driver will turn off learning. At
>>the same time we call br_fdb_delete_by_port which calls a deferred
>>set of fdb deletes.
>>
>>I don't see how you guarantee learning is off before you start doing
>>the deletes here and possibly learning new addresses after the software
>>side believes the port is down.
>>
>>So
>>
>>   br_stp_disable_port
>>   br_fdb_delete_by_port
>>   {fdb_del_external_learn}
>>   [hw learns a fdb]
>>   [hw disables learning]
>>
>>What stops this from happening?
>
> Okay. This behaviour is the same as without the patchset. What would
> resolve the issue it to put switchdev_deferred_process() after
> br_stp_disable_port() and before br_fdb_delete_by_port() call.
> That would enforce stp change to happen in hw before fdbs are explicitly
> deleted. Sound good to you?

Doesn't HW already see things in the right order since items are
dequeued from the deferred list in the order queued?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

2015-10-16 Thread Alexei Starovoitov

On 10/16/15 3:25 AM, Hannes Frederic Sowa wrote:

Namespaces at some point dealt with the same problem, they nowadays use
bind mounts of/proc/$$/ns/* to some place in the file hierarchy to keep
the namespace alive. This at least allows someone to build up its own
hierarchy with normal unix tools and not hidden inside a C-program. For
filedescriptors we already have/proc/$$/fd/* but it seems that doesn't
work out of the box nowadays.


bind mounting of /proc/../fd was initially proposed by Andy and we've
looked at it thoroughly, but after discussion with Eric it became
apparent that it doesn't fit here. At the end we need shell tools
to access maps.
Also I think you missed the hierarchy in this patch set _is_ built with
normal 'mkdir' and files are removed with 'rm'.
The only thing that C does is BPF_PIN_FD of fd that was received from
bpf syscall. That's way cleaner api than doing bind mount from C
program.
We've considered letting open() of the file return bpf specific
anon-inode, but decided to reserve that for other more natural file
operations. Therefore BPF_NEW_FD is needed.


I don't know in terms of how many objects bpf should be able to handle
and if such a bind-mount based solution would work, I guess not.


We definitely missed you at the last plumbers where it was discussed :)


In my opinion I still favor a user space approach.


that's not acceptable for tracing use cases. No daemons allowed.


Subsystems which use
ebpf in a way that no user space program needs to be running to control
them would need to export the fds by itself. E.g. something like
sysfs/kobject for tc? The hierarchy would then be in control of the
subsystem which could also create a proper naming hierarchy or maybe
even use an already given one. Do most other eBPF users really need to
persist file descriptors somewhere without user space control and pick
them up later?


I think it's way cleaner to have one way of solving it (like this patch
does) instead of asking every subsystem to solve it differently.
We've also looked at sysfs and it's ugly when it comes to removing,
since the user cannot use normal 'rm'.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   >