[net-next 05/17] i40evf: don't give up
From: Mitch Williams When the VF driver is unable to communicate with the PF, it just gives up and never tries again. Aside from the obvious character flaw that this shows, it's also a lousy user experience. When PF communications fail, wait five seconds, and try again. And again. Don't give up, little VF driver! Your prince will come! Change-ID: Ia1378a39879883563b8faffce819f375821f9585 Signed-off-by: Mitch Williams Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40evf/i40evf_main.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c index 1dd5245..ce99790 100644 --- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c +++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c @@ -2357,9 +2357,12 @@ err_alloc: err: /* Things went into the weeds, so try again later */ if (++adapter->aq_wait_count > I40EVF_AQ_MAX_ERR) { - dev_err(&pdev->dev, "Failed to communicate with PF; giving up\n"); + dev_err(&pdev->dev, "Failed to communicate with PF; waiting before retry\n"); adapter->flags |= I40EVF_FLAG_PF_COMMS_FAILED; - return; /* do not reschedule */ + i40evf_shutdown_adminq(hw); + adapter->state = __I40EVF_STARTUP; + schedule_delayed_work(&adapter->init_task, HZ * 5); + return; } schedule_delayed_work(&adapter->init_task, HZ); } -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 06/17] i40e/i40evf: refactor IRQ enable function
From: Jesse Brandeburg This change moves a multi-line register setting into a function which simplifies reading the flow of the enable function. This also fixes a bug where the enable function was enabling the interrupt twice while trying to update the two interrupt throttle rate thresholds for Rx and Tx. Change-ID: Ie308f9d0d48540204590cb9d7a5a7b1196f959bb Signed-off-by: Jesse Brandeburg Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 112 +++--- drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 108 ++--- 2 files changed, 128 insertions(+), 92 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index 512707c..889d25d 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -815,6 +815,8 @@ void i40e_force_wb(struct i40e_vsi *vsi, struct i40e_q_vector *q_vector) * i40e_set_new_dynamic_itr - Find new ITR level * @rc: structure containing ring performance data * + * Returns true if ITR changed, false if not + * * Stores a new ITR value based on packets and byte counts during * the last interrupt. The advantage of per interrupt computation * is faster updates and more accurate ITR for the current traffic @@ -823,14 +825,14 @@ void i40e_force_wb(struct i40e_vsi *vsi, struct i40e_q_vector *q_vector) * testing data as well as attempting to minimize response time * while increasing bulk throughput. **/ -static void i40e_set_new_dynamic_itr(struct i40e_ring_container *rc) +static bool i40e_set_new_dynamic_itr(struct i40e_ring_container *rc) { enum i40e_latency_range new_latency_range = rc->latency_range; u32 new_itr = rc->itr; int bytes_per_int; if (rc->total_packets == 0 || !rc->itr) - return; + return false; /* simple throttlerate management * 0-10MB/s lowest (10 ints/s) @@ -874,11 +876,15 @@ static void i40e_set_new_dynamic_itr(struct i40e_ring_container *rc) break; } - if (new_itr != rc->itr) - rc->itr = new_itr; - rc->total_bytes = 0; rc->total_packets = 0; + + if (new_itr != rc->itr) { + rc->itr = new_itr; + return true; + } + + return false; } /** @@ -1747,6 +1753,21 @@ static int i40e_clean_rx_irq_1buf(struct i40e_ring *rx_ring, int budget) return total_rx_packets; } +static u32 i40e_buildreg_itr(const int type, const u16 itr) +{ + u32 val; + + val = I40E_PFINT_DYN_CTLN_INTENA_MASK | + I40E_PFINT_DYN_CTLN_CLEARPBA_MASK | + (type << I40E_PFINT_DYN_CTLN_ITR_INDX_SHIFT) | + (itr << I40E_PFINT_DYN_CTLN_INTERVAL_SHIFT); + + return val; +} + +/* a small macro to shorten up some long lines */ +#define INTREG I40E_PFINT_DYN_CTLN + /** * i40e_update_enable_itr - Update itr and re-enable MSIX interrupt * @vsi: the VSI we care about @@ -1757,54 +1778,53 @@ static inline void i40e_update_enable_itr(struct i40e_vsi *vsi, struct i40e_q_vector *q_vector) { struct i40e_hw *hw = &vsi->back->hw; - u16 old_itr; + bool rx = false, tx = false; + u32 rxval, txval; int vector; - u32 val; vector = (q_vector->v_idx + vsi->base_vector); + + rxval = txval = i40e_buildreg_itr(I40E_ITR_NONE, 0); + if (ITR_IS_DYNAMIC(vsi->rx_itr_setting)) { - old_itr = q_vector->rx.itr; - i40e_set_new_dynamic_itr(&q_vector->rx); - if (old_itr != q_vector->rx.itr) { - val = I40E_PFINT_DYN_CTLN_INTENA_MASK | - I40E_PFINT_DYN_CTLN_CLEARPBA_MASK | - (I40E_RX_ITR << - I40E_PFINT_DYN_CTLN_ITR_INDX_SHIFT) | - (q_vector->rx.itr << - I40E_PFINT_DYN_CTLN_INTERVAL_SHIFT); - } else { - val = I40E_PFINT_DYN_CTLN_INTENA_MASK | - I40E_PFINT_DYN_CTLN_CLEARPBA_MASK | - (I40E_ITR_NONE << - I40E_PFINT_DYN_CTLN_ITR_INDX_SHIFT); - } - if (!test_bit(__I40E_DOWN, &vsi->state)) - wr32(hw, I40E_PFINT_DYN_CTLN(vector - 1), val); - } else { - i40e_irq_dynamic_enable(vsi, q_vector->v_idx); + rx = i40e_set_new_dynamic_itr(&q_vector->rx); + rxval = i40e_buildreg_itr(I40E_RX_ITR, q_vector->rx.itr); } + if (ITR_IS_DYNAMIC(vsi->tx_itr_setting)) { - old_itr = q_vector->tx.itr; - i40e_set_new_dynamic_itr(&q_vector->tx); - if (old_itr != q_vector->tx.itr) { - val = I40E_PFINT_DYN_CTLN_INTENA_MASK | -
[net-next 07/17] i40e/i40evf: fix bug in throttle rate math
From: Jesse Brandeburg The driver was using a value expressed in 2us increments for the divisor to figure out our bytes/usec values. Fix the usecs variable to contain a value in microseconds. Change-ID: I5c20493103c295d6f201947bb908add7040b7c41 Signed-off-by: Jesse Brandeburg Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 9 - drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 9 - 2 files changed, 16 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index 889d25d..c96e581 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -830,6 +830,7 @@ static bool i40e_set_new_dynamic_itr(struct i40e_ring_container *rc) enum i40e_latency_range new_latency_range = rc->latency_range; u32 new_itr = rc->itr; int bytes_per_int; + int usecs; if (rc->total_packets == 0 || !rc->itr) return false; @@ -838,8 +839,14 @@ static bool i40e_set_new_dynamic_itr(struct i40e_ring_container *rc) * 0-10MB/s lowest (10 ints/s) * 10-20MB/s low(2 ints/s) * 20-1249MB/s bulk (8000 ints/s) +* +* The math works out because the divisor is in 10^(-6) which +* turns the bytes/us input value into MB/s values, but +* make sure to use usecs, as the register values written +* are in 2 usec increments in the ITR registers. */ - bytes_per_int = rc->total_bytes / rc->itr; + usecs = (rc->itr << 1); + bytes_per_int = rc->total_bytes / usecs; switch (new_latency_range) { case I40E_LOWEST_LATENCY: if (bytes_per_int > 10) diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c index 597e096..7d98042 100644 --- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c @@ -333,6 +333,7 @@ static bool i40e_set_new_dynamic_itr(struct i40e_ring_container *rc) enum i40e_latency_range new_latency_range = rc->latency_range; u32 new_itr = rc->itr; int bytes_per_int; + int usecs; if (rc->total_packets == 0 || !rc->itr) return false; @@ -341,8 +342,14 @@ static bool i40e_set_new_dynamic_itr(struct i40e_ring_container *rc) * 0-10MB/s lowest (10 ints/s) * 10-20MB/s low(2 ints/s) * 20-1249MB/s bulk (8000 ints/s) +* +* The math works out because the divisor is in 10^(-6) which +* turns the bytes/us input value into MB/s values, but +* make sure to use usecs, as the register values written +* are in 2 usec increments in the ITR registers. */ - bytes_per_int = rc->total_bytes / rc->itr; + usecs = (rc->itr << 1); + bytes_per_int = rc->total_bytes / usecs; switch (new_latency_range) { case I40E_LOWEST_LATENCY: if (bytes_per_int > 10) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 14/17] i40e: Move error message to debug level
From: Catherine Sullivan There is an error coming back from get_phy_capabilities that does not seem to have any functional implications. We will continue looking into why this error message is occurring, but in the meantime, we will move it to debug to avoid confusion. Change-ID: I9091754bf62c066ddedeb249923d85606e2d68ed Signed-off-by: Catherine Sullivan Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c index 6e5f931..8fd26fd 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c @@ -751,9 +751,9 @@ static int i40e_set_settings(struct net_device *netdev, status = i40e_update_link_info(hw); if (status) - netdev_info(netdev, "Updating link info failed with err %s aq_err %s\n", - i40e_stat_str(hw, status), - i40e_aq_str(hw, hw->aq.asq_last_status)); + netdev_dbg(netdev, "Updating link info failed with err %s aq_err %s\n", + i40e_stat_str(hw, status), + i40e_aq_str(hw, hw->aq.asq_last_status)); } else { netdev_info(netdev, "Nothing changed, exiting without setting anything.\n"); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 00/17][pull request] Intel Wired LAN Driver Updates 2015-10-16
This series contains updates to i40e and i40evf only. Kiran adds a spinlock around code accessing VSI MAC filter list to ensure that we are synchronizing access to the filter list, otherwise we can end up with multiple accesses at the same time which can cause the VSI MAC filter list to get in an unstable or corrupted state. Jesse fixes overlong BIT defines, where the RSS enabling call were mistakenly missed. Also fixes a bug where the enable function was enabling the interrupt twice while trying to update the two interrupt throttle rate thresholds for Rx and Tx, while refactoring the IRQ enable function to simplify reading the flow. Addressed the high CPU utilization of some small streaming workloads that the driver should reduce CPU in. Anjali fixes two X722 issues with respect to EEPROM checksum verify and reading NVM version info. Fixed where a mask value was accidentally replaced with a bit mask causing Flow Director sideband to be broken. Alex Duyck fixes areas of the drivers which run from hard interrupt context or with interrupts already disabled in netpoll, so use napi_schedule_irqoff() instead of napi_schedule(). Mitch fixes the VF drivers to not easily give up when it is not able to communicate with the PF driver. Carolyn fixes a problem where our tools MAC loopback test, after driver unbind would fail because the hardware was configured for multiqueue and unbind operation did not clear this configuration. Also fixed a issue where the NVMUpdate tool gets bad data from the PHY when using the PHY NVM feature because of contention on the MDIO interface from getting PHY capability calls from the driver during regular operations. Catherine fixed an issue where we were checking if autoneg was allowed to change before checking if autoneg was changing, these checks need to be in the reverse order. Jean Sacren fixes up an function header comment to align the kernel-docs with the actual code. The following are changes since commit 4be3158abe1e02d24f82b34101e41d662fae2185: Merge branch 'mlxsw-spectrum' and are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue master Alexander Duyck (1): i40e/i40evf: use napi_schedule_irqoff() Anjali Singhai (1): i40e: Fix basic support for X722 devices Anjali Singhai Jain (1): i40e/i40evf: Fix an accidental error with BIT_ULL replacement Carolyn Wyborny (2): i40e: Fix for Tools loopback test failing after driver load i40e: fix for PHY NVM interaction problem Catherine Sullivan (3): i40e: Fix order of checks when enabling/disabling autoneg in ethtool i40e: Move error message to debug level i40e/i40evf: Bump i40e to 1.3.38 and i40evf to 1.3.25 Jean Sacren (2): i40e: fix kernel-doc argument name i40e: declare rather than initialize int object Jesse Brandeburg (5): i40evf: fix overlong BIT defines i40e/i40evf: refactor IRQ enable function i40e/i40evf: fix bug in throttle rate math i40e/i40evf: change dynamic interrupt thresholds i40e/i40evf: adjust interrupt throttle less frequently Kiran Patil (1): i40e: Lock for VSI's MAC filter list Mitch Williams (1): i40evf: don't give up drivers/net/ethernet/intel/i40e/i40e.h | 4 + drivers/net/ethernet/intel/i40e/i40e_common.c | 17 +- drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 35 ++- drivers/net/ethernet/intel/i40e/i40e_fcoe.c| 2 + drivers/net/ethernet/intel/i40e/i40e_main.c| 347 + drivers/net/ethernet/intel/i40e/i40e_nvm.c | 31 +- drivers/net/ethernet/intel/i40e/i40e_txrx.c| 166 ++ drivers/net/ethernet/intel/i40e/i40e_txrx.h| 7 +- drivers/net/ethernet/intel/i40e/i40e_type.h| 7 +- drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 26 +- drivers/net/ethernet/intel/i40evf/i40e_common.c| 5 - drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 163 ++ drivers/net/ethernet/intel/i40evf/i40e_txrx.h | 21 +- drivers/net/ethernet/intel/i40evf/i40e_type.h | 4 +- drivers/net/ethernet/intel/i40evf/i40evf.h | 2 + drivers/net/ethernet/intel/i40evf/i40evf_main.c| 13 +- 16 files changed, 631 insertions(+), 219 deletions(-) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 15/17] i40e: fix kernel-doc argument name
From: Jean Sacren The second argument name in the kernel-doc argument list for i40e_features_check() was slightly off. Fix it for the kernel doc. Signed-off-by: Jean Sacren Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 9b4fa3e..54af6d6 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -8562,7 +8562,7 @@ static int i40e_ndo_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq, /** * i40e_features_check - Validate encapsulated packet conforms to limits * @skb: skb buff - * @netdev: This physical port's netdev + * @dev: This physical port's netdev * @features: Offload features that the stack believes apply **/ static netdev_features_t i40e_features_check(struct sk_buff *skb, -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 16/17] i40e: declare rather than initialize int object
From: Jean Sacren 'err' would be overwritten immediately, so we should declare it only rather than initialize it to zero. Signed-off-by: Jean Sacren Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 54af6d6..92b9d27 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -10179,7 +10179,7 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent) static u16 pfs_found; u16 wol_nvm_bits; u16 link_status; - int err = 0; + int err; u32 len; u32 i; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 03/17] i40e: Fix basic support for X722 devices
From: Anjali Singhai Acquire NVM, before issuing an AQ read nvm command for X722. We need to acquire the NVM before issuing an AQ read to the NVM otherwise we will get EBUSY from the FW. Also release when done. This fixes the two X722 issues with respect to eeprom checksum verify and reading NVM version info. With this patch in place, i40e driver will provide basic support for X722 devices. Signed-off-by: Anjali Singhai Jain Acked-by: Jesse Brandeburg Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_common.c | 3 +++ drivers/net/ethernet/intel/i40e/i40e_nvm.c| 31 +-- drivers/net/ethernet/intel/i40e/i40e_type.h | 3 +++ 3 files changed, 31 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c b/drivers/net/ethernet/intel/i40e/i40e_common.c index daa6b17..7f456e9 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_common.c +++ b/drivers/net/ethernet/intel/i40e/i40e_common.c @@ -958,6 +958,9 @@ i40e_status i40e_init_shared_code(struct i40e_hw *hw) else hw->pf_id = (u8)(func_rid & 0x7); + if (hw->mac.type == I40E_MAC_X722) + hw->flags |= I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE; + status = i40e_init_nvm(hw); return status; } diff --git a/drivers/net/ethernet/intel/i40e/i40e_nvm.c b/drivers/net/ethernet/intel/i40e/i40e_nvm.c index 875a8cc..58e384a 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_nvm.c +++ b/drivers/net/ethernet/intel/i40e/i40e_nvm.c @@ -290,9 +290,18 @@ static i40e_status i40e_read_nvm_word_aq(struct i40e_hw *hw, u16 offset, i40e_status i40e_read_nvm_word(struct i40e_hw *hw, u16 offset, u16 *data) { - if (hw->mac.type == I40E_MAC_X722) - return i40e_read_nvm_word_aq(hw, offset, data); - return i40e_read_nvm_word_srctl(hw, offset, data); + enum i40e_status_code ret_code = 0; + + if (hw->flags & I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE) { + ret_code = i40e_acquire_nvm(hw, I40E_RESOURCE_READ); + if (!ret_code) { + ret_code = i40e_read_nvm_word_aq(hw, offset, data); + i40e_release_nvm(hw); + } + } else { + ret_code = i40e_read_nvm_word_srctl(hw, offset, data); + } + return ret_code; } /** @@ -397,9 +406,19 @@ read_nvm_buffer_aq_exit: i40e_status i40e_read_nvm_buffer(struct i40e_hw *hw, u16 offset, u16 *words, u16 *data) { - if (hw->mac.type == I40E_MAC_X722) - return i40e_read_nvm_buffer_aq(hw, offset, words, data); - return i40e_read_nvm_buffer_srctl(hw, offset, words, data); + enum i40e_status_code ret_code = 0; + + if (hw->flags & I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE) { + ret_code = i40e_acquire_nvm(hw, I40E_RESOURCE_READ); + if (!ret_code) { + ret_code = i40e_read_nvm_buffer_aq(hw, offset, words, + data); + i40e_release_nvm(hw); + } + } else { + ret_code = i40e_read_nvm_buffer_srctl(hw, offset, words, data); + } + return ret_code; } /** diff --git a/drivers/net/ethernet/intel/i40e/i40e_type.h b/drivers/net/ethernet/intel/i40e/i40e_type.h index 9c4a457..f631ca0 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_type.h +++ b/drivers/net/ethernet/intel/i40e/i40e_type.h @@ -544,6 +544,9 @@ struct i40e_hw { struct i40e_dcbx_config remote_dcbx_config; /* Peer Cfg */ struct i40e_dcbx_config desired_dcbx_config; /* CEE Desired Cfg */ +#define I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE BIT_ULL(0) + u64 flags; + /* debug mask */ u32 debug_mask; char err_str[16]; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 12/17] i40e/i40evf: Fix an accidental error with BIT_ULL replacement
From: Anjali Singhai Jain A mask value of 0x1FF was accidentally replaced with a bit mask causing flow director sideband to be broken. Change-ID: Id3387f67dd1b567b41692b570b383c58671e1eae Signed-off-by: Anjali Singhai Jain Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_type.h | 4 ++-- drivers/net/ethernet/intel/i40evf/i40e_type.h | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_type.h b/drivers/net/ethernet/intel/i40e/i40e_type.h index f631ca0..dd2da35 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_type.h +++ b/drivers/net/ethernet/intel/i40e/i40e_type.h @@ -1068,8 +1068,8 @@ enum i40e_filter_program_desc_fd_status { }; #define I40E_TXD_FLTR_QW0_DEST_VSI_SHIFT 23 -#define I40E_TXD_FLTR_QW0_DEST_VSI_MASK \ - BIT_ULL(I40E_TXD_FLTR_QW0_DEST_VSI_SHIFT) +#define I40E_TXD_FLTR_QW0_DEST_VSI_MASK(0x1FFUL << \ +I40E_TXD_FLTR_QW0_DEST_VSI_SHIFT) #define I40E_TXD_FLTR_QW1_CMD_SHIFT4 #define I40E_TXD_FLTR_QW1_CMD_MASK (0xULL << \ diff --git a/drivers/net/ethernet/intel/i40evf/i40e_type.h b/drivers/net/ethernet/intel/i40evf/i40e_type.h index 85af3b4..301fe2b 100644 --- a/drivers/net/ethernet/intel/i40evf/i40e_type.h +++ b/drivers/net/ethernet/intel/i40evf/i40e_type.h @@ -1055,8 +1055,8 @@ enum i40e_filter_program_desc_fd_status { }; #define I40E_TXD_FLTR_QW0_DEST_VSI_SHIFT 23 -#define I40E_TXD_FLTR_QW0_DEST_VSI_MASK \ - BIT_ULL(I40E_TXD_FLTR_QW0_DEST_VSI_SHIFT) +#define I40E_TXD_FLTR_QW0_DEST_VSI_MASK(0x1FFUL << \ +I40E_TXD_FLTR_QW0_DEST_VSI_SHIFT) #define I40E_TXD_FLTR_QW1_CMD_SHIFT4 #define I40E_TXD_FLTR_QW1_CMD_MASK (0xULL << \ -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 01/17] i40e: Lock for VSI's MAC filter list
From: Kiran Patil This patch introduces a spinlock which is to be used for synchronizing access to VSI's MAC filter list. This patch also synchronizes execution of other codepaths which are accessing VSI's MAC filter list with execution of service_task:sync_vsi_filters. In function i40e_add_vsi, copied out LAA MAC address instead of cloning MAC filter entry because only MAC address is needed to remove MAC VLAN filter from FW/HW. Change-ID: I0e10ac7c715d44aa994239642aa4d57c998573a2 Signed-off-by: Kiran Patil Signed-off-by: Jesse Brandeburg Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e.h | 2 + drivers/net/ethernet/intel/i40e/i40e_fcoe.c| 2 + drivers/net/ethernet/intel/i40e/i40e_main.c| 328 + drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 26 +- 4 files changed, 302 insertions(+), 56 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h index 7a58b1f..b7818be 100644 --- a/drivers/net/ethernet/intel/i40e/i40e.h +++ b/drivers/net/ethernet/intel/i40e/i40e.h @@ -468,6 +468,8 @@ struct i40e_vsi { #define I40E_VSI_FLAG_VEB_OWNERBIT(1) unsigned long flags; + /* Per VSI lock to protect elements/list (MAC filter) */ + spinlock_t mac_filter_list_lock; struct list_head mac_filter_list; /* VSI stats */ diff --git a/drivers/net/ethernet/intel/i40e/i40e_fcoe.c b/drivers/net/ethernet/intel/i40e/i40e_fcoe.c index 2ec2411..fe5d9bf 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_fcoe.c +++ b/drivers/net/ethernet/intel/i40e/i40e_fcoe.c @@ -1516,10 +1516,12 @@ void i40e_fcoe_config_netdev(struct net_device *netdev, struct i40e_vsi *vsi) * same PCI function. */ netdev->dev_port = 1; + spin_lock_bh(&vsi->mac_filter_list_lock); i40e_add_filter(vsi, hw->mac.san_addr, 0, false, false); i40e_add_filter(vsi, (u8[6]) FC_FCOE_FLOGI_MAC, 0, false, false); i40e_add_filter(vsi, FIP_ALL_FCOE_MACS, 0, false, false); i40e_add_filter(vsi, FIP_ALL_ENODE_MACS, 0, false, false); + spin_unlock_bh(&vsi->mac_filter_list_lock); /* use san mac */ ether_addr_copy(netdev->dev_addr, hw->mac.san_addr); diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 87a5d09..aa2aafd 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -1223,6 +1223,9 @@ static struct i40e_mac_filter *i40e_find_filter(struct i40e_vsi *vsi, { struct i40e_mac_filter *f; + WARN(!spin_is_locked(&vsi->mac_filter_list_lock), +"Missing mac_filter_list_lock\n"); + if (!vsi || !macaddr) return NULL; @@ -1251,6 +1254,9 @@ struct i40e_mac_filter *i40e_find_mac(struct i40e_vsi *vsi, u8 *macaddr, { struct i40e_mac_filter *f; + WARN(!spin_is_locked(&vsi->mac_filter_list_lock), +"Missing mac_filter_list_lock\n"); + if (!vsi || !macaddr) return NULL; @@ -1273,6 +1279,9 @@ bool i40e_is_vsi_in_vlan(struct i40e_vsi *vsi) { struct i40e_mac_filter *f; + WARN(!spin_is_locked(&vsi->mac_filter_list_lock), +"Missing mac_filter_list_lock\n"); + /* Only -1 for all the filters denotes not in vlan mode * so we have to go through all the list in order to make sure */ @@ -1301,6 +1310,9 @@ struct i40e_mac_filter *i40e_put_mac_in_vlan(struct i40e_vsi *vsi, u8 *macaddr, { struct i40e_mac_filter *f; + WARN(!spin_is_locked(&vsi->mac_filter_list_lock), +"Missing mac_filter_list_lock\n"); + list_for_each_entry(f, &vsi->mac_filter_list, list) { if (vsi->info.pvid) f->vlan = le16_to_cpu(vsi->info.pvid); @@ -1355,6 +1367,10 @@ static int i40e_rm_default_mac_filter(struct i40e_vsi *vsi, u8 *macaddr) * @is_netdev: make sure its a netdev filter, else doesn't matter * * Returns ptr to the filter object or NULL when no memory available. + * + * NOTE: This function is expected to be called with mac_filter_list_lock + * being held. If needed could add WARN/BUG_ON if lock is not held for debug + * purpose. **/ struct i40e_mac_filter *i40e_add_filter(struct i40e_vsi *vsi, u8 *macaddr, s16 vlan, @@ -1362,6 +1378,9 @@ struct i40e_mac_filter *i40e_add_filter(struct i40e_vsi *vsi, { struct i40e_mac_filter *f; + WARN(!spin_is_locked(&vsi->mac_filter_list_lock), +"Missing mac_filter_list_lock\n"); + if (!vsi || !macaddr) return NULL; @@ -1413,6 +1432,10 @@ add_filter_out: * @vlan: the vlan * @is_vf: make sure it's a VF filter, else doesn't matter * @is_netdev: make sure it's a netdev filter, else doesn't matter + * + * NOTE: This function is expected to be call
[net-next 17/17] i40e/i40evf: Bump i40e to 1.3.38 and i40evf to 1.3.25
From: Catherine Sullivan Bump. Change-ID: Id0a7ecaa491f88ce94c9eba4901e592a56044ee0 Signed-off-by: Catherine Sullivan Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +- drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 92b9d27..7af52d3 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -39,7 +39,7 @@ static const char i40e_driver_string[] = #define DRV_VERSION_MAJOR 1 #define DRV_VERSION_MINOR 3 -#define DRV_VERSION_BUILD 34 +#define DRV_VERSION_BUILD 38 #define DRV_VERSION __stringify(DRV_VERSION_MAJOR) "." \ __stringify(DRV_VERSION_MINOR) "." \ __stringify(DRV_VERSION_BUILD)DRV_KERN diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c index a44a42b..10b2a4a 100644 --- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c +++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c @@ -34,7 +34,7 @@ char i40evf_driver_name[] = "i40evf"; static const char i40evf_driver_string[] = "Intel(R) XL710/X710 Virtual Function Network Driver"; -#define DRV_VERSION "1.3.21" +#define DRV_VERSION "1.3.25" const char i40evf_driver_version[] = DRV_VERSION; static const char i40evf_copyright[] = "Copyright (c) 2013 - 2015 Intel Corporation."; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 08/17] i40e/i40evf: change dynamic interrupt thresholds
From: Jesse Brandeburg The dynamic algorithm, while now working, doesn't have good performance in 40G mode. One part of this patch addresses the high CPU utilization of some small streaming workloads that the driver should reduce CPU in. It also changes the minimum ITR that the dynamic algorithm will settle on, causing our minimum latency to go from 12us to about 14us, when using adaptive mode. It also changes the BULK interrupt rate to allow maximum throughput on a 40Gb connection with a single thread of transmit, clamping interrupt rate to 8000 for TX makes single thread traffic go too slow. The new ULTRA bulk setting is introduced and is used when the Rx packet rate on this queue exceeds 4 packets per second. This value of 4 was chosen because the automatic tuning of minimum ITR=20us means that a single queue can't quite achieve that many packets per second from a round-robin test. Change-ID: Icce8faa128688ca5fd2c4229bdd9726877a92ea2 Signed-off-by: Jesse Brandeburg Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 27 +-- drivers/net/ethernet/intel/i40e/i40e_txrx.h | 3 +++ drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 27 +-- drivers/net/ethernet/intel/i40evf/i40e_txrx.h | 3 +++ 4 files changed, 48 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index c96e581..7aea143 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -828,6 +828,7 @@ void i40e_force_wb(struct i40e_vsi *vsi, struct i40e_q_vector *q_vector) static bool i40e_set_new_dynamic_itr(struct i40e_ring_container *rc) { enum i40e_latency_range new_latency_range = rc->latency_range; + struct i40e_q_vector *qv = rc->ring->q_vector; u32 new_itr = rc->itr; int bytes_per_int; int usecs; @@ -836,9 +837,10 @@ static bool i40e_set_new_dynamic_itr(struct i40e_ring_container *rc) return false; /* simple throttlerate management -* 0-10MB/s lowest (10 ints/s) +* 0-10MB/s lowest (5 ints/s) * 10-20MB/s low(2 ints/s) -* 20-1249MB/s bulk (8000 ints/s) +* 20-1249MB/s bulk (18000 ints/s) +* > 4 Rx packets per second (8000 ints/s) * * The math works out because the divisor is in 10^(-6) which * turns the bytes/us input value into MB/s values, but @@ -859,24 +861,37 @@ static bool i40e_set_new_dynamic_itr(struct i40e_ring_container *rc) new_latency_range = I40E_LOWEST_LATENCY; break; case I40E_BULK_LATENCY: - if (bytes_per_int <= 20) - new_latency_range = I40E_LOW_LATENCY; - break; + case I40E_ULTRA_LATENCY: default: if (bytes_per_int <= 20) new_latency_range = I40E_LOW_LATENCY; break; } + + /* this is to adjust RX more aggressively when streaming small +* packets. The value of 4 was picked as it is just beyond +* what the hardware can receive per second if in low latency +* mode. +*/ +#define RX_ULTRA_PACKET_RATE 4 + + if rc->total_packets * 100) / usecs) > RX_ULTRA_PACKET_RATE) && + (&qv->rx == rc)) + new_latency_range = I40E_ULTRA_LATENCY; + rc->latency_range = new_latency_range; switch (new_latency_range) { case I40E_LOWEST_LATENCY: - new_itr = I40E_ITR_100K; + new_itr = I40E_ITR_50K; break; case I40E_LOW_LATENCY: new_itr = I40E_ITR_20K; break; case I40E_BULK_LATENCY: + new_itr = I40E_ITR_18K; + break; + case I40E_ULTRA_LATENCY: new_itr = I40E_ITR_8K; break; default: diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h index 7c0ed84..0fe7eb7 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h @@ -32,7 +32,9 @@ #define I40E_MAX_ITR 0x0FF0 /* reg uses 2 usec resolution */ #define I40E_MIN_ITR 0x0001 /* reg uses 2 usec resolution */ #define I40E_ITR_100K 0x0005 +#define I40E_ITR_50K 0x000A #define I40E_ITR_20K 0x0019 +#define I40E_ITR_18K 0x001B #define I40E_ITR_8K0x003E #define I40E_ITR_4K0x007A #define I40E_MAX_INTRL 0x3B/* reg uses 4 usec resolution */ @@ -296,6 +298,7 @@ enum i40e_latency_range { I40E_LOWEST_LATENCY = 0, I40E_LOW_LATENCY = 1, I40E_BULK_LATENCY = 2, + I40E_ULTRA_LATENCY = 3, };
[net-next 10/17] i40e: Fix for Tools loopback test failing after driver load
From: Carolyn Wyborny This patch fixes a problem where our Tools MAC Loopback test, after driver unbind would fail. This was because the hw was configured for multiqueue and unbind operation did not clear this configuration. The problem is fixed by resetting this configuration in i40e_remove. Change-ID: I130c05138319182ed1476d3a0b5222d6a6320af9 Signed-off-by: Carolyn Wyborny Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_main.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index b105da1..9b4fa3e 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -10650,6 +10650,7 @@ err_dma: static void i40e_remove(struct pci_dev *pdev) { struct i40e_pf *pf = pci_get_drvdata(pdev); + struct i40e_hw *hw = &pf->hw; i40e_status ret_code; int i; @@ -10657,6 +10658,10 @@ static void i40e_remove(struct pci_dev *pdev) i40e_ptp_stop(pf); + /* Disable RSS in hw */ + wr32(hw, I40E_PFQF_HENA(0), 0); + wr32(hw, I40E_PFQF_HENA(1), 0); + /* no more scheduling of any task */ set_bit(__I40E_DOWN, &pf->state); del_timer_sync(&pf->service_timer); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 09/17] i40e/i40evf: adjust interrupt throttle less frequently
From: Jesse Brandeburg The adaptive ITR (interrupt throttle rate) algorithm was adjusting the hardware's interrupt rate too frequently. This caused a lot of variation in the interrupt rate for fairly constant workloads. Change the code to have a counter and adjust only once every N number of interrupts. Change-ID: I0460f1f86571037484eca5aca36ac4d889cb8389 Signed-off-by: Jesse Brandeburg Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e.h | 2 ++ drivers/net/ethernet/intel/i40e/i40e_main.c | 2 ++ drivers/net/ethernet/intel/i40e/i40e_txrx.c | 22 -- drivers/net/ethernet/intel/i40e/i40e_txrx.h | 4 ++-- drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 23 +-- drivers/net/ethernet/intel/i40evf/i40e_txrx.h | 4 ++-- drivers/net/ethernet/intel/i40evf/i40evf.h | 2 ++ drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 ++ 8 files changed, 53 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h index b7818be..7c2b2e8 100644 --- a/drivers/net/ethernet/intel/i40e/i40e.h +++ b/drivers/net/ethernet/intel/i40e/i40e.h @@ -577,6 +577,8 @@ struct i40e_q_vector { struct rcu_head rcu;/* to avoid race with update stats on free */ char name[I40E_INT_NAME_STR_LEN]; bool arm_wb_state; +#define ITR_COUNTDOWN_START 100 + u8 itr_countdown; /* when 0 should adjust ITR */ } cacheline_internodealigned_in_smp; /* lan device */ diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 11dbb29..b105da1 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -3107,6 +3107,7 @@ static void i40e_vsi_configure_msix(struct i40e_vsi *vsi) for (i = 0; i < vsi->num_q_vectors; i++, vector++) { struct i40e_q_vector *q_vector = vsi->q_vectors[i]; + q_vector->itr_countdown = ITR_COUNTDOWN_START; q_vector->rx.itr = ITR_TO_REG(vsi->rx_itr_setting); q_vector->rx.latency_range = I40E_LOW_LATENCY; wr32(hw, I40E_PFINT_ITRN(I40E_RX_ITR, vector - 1), @@ -3202,6 +3203,7 @@ static void i40e_configure_msi_and_legacy(struct i40e_vsi *vsi) u32 val; /* set the ITR configuration */ + q_vector->itr_countdown = ITR_COUNTDOWN_START; q_vector->rx.itr = ITR_TO_REG(vsi->rx_itr_setting); q_vector->rx.latency_range = I40E_LOW_LATENCY; wr32(hw, I40E_PFINT_ITR0(I40E_RX_ITR), q_vector->rx.itr); diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index 7aea143..006f0fb 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -845,10 +845,12 @@ static bool i40e_set_new_dynamic_itr(struct i40e_ring_container *rc) * The math works out because the divisor is in 10^(-6) which * turns the bytes/us input value into MB/s values, but * make sure to use usecs, as the register values written -* are in 2 usec increments in the ITR registers. +* are in 2 usec increments in the ITR registers, and make sure +* to use the smoothed values that the countdown timer gives us. */ - usecs = (rc->itr << 1); + usecs = (rc->itr << 1) * ITR_COUNTDOWN_START; bytes_per_int = rc->total_bytes / usecs; + switch (new_latency_range) { case I40E_LOWEST_LATENCY: if (bytes_per_int > 10) @@ -1806,8 +1808,17 @@ static inline void i40e_update_enable_itr(struct i40e_vsi *vsi, vector = (q_vector->v_idx + vsi->base_vector); + /* avoid dynamic calculation if in countdown mode OR if +* all dynamic is disabled +*/ rxval = txval = i40e_buildreg_itr(I40E_ITR_NONE, 0); + if (q_vector->itr_countdown > 0 || + (!ITR_IS_DYNAMIC(vsi->rx_itr_setting) && +!ITR_IS_DYNAMIC(vsi->tx_itr_setting))) { + goto enable_int; + } + if (ITR_IS_DYNAMIC(vsi->rx_itr_setting)) { rx = i40e_set_new_dynamic_itr(&q_vector->rx); rxval = i40e_buildreg_itr(I40E_RX_ITR, q_vector->rx.itr); @@ -1845,8 +1856,15 @@ static inline void i40e_update_enable_itr(struct i40e_vsi *vsi, wr32(hw, INTREG(vector - 1), rxval); } +enable_int: if (!test_bit(__I40E_DOWN, &vsi->state)) wr32(hw, INTREG(vector - 1), txval); + + if (q_vector->itr_countdown) + q_vector->itr_countdown--; + else + q_vector->itr_countdown = ITR_COUNTDOWN_START; + } /** diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h index 0fe7eb7..6779fb7 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h +++ b/drivers/net/ethernet/
[net-next 02/17] i40evf: fix overlong BIT defines
From: Jesse Brandeburg The defines from the RSS enabling call were mistakenly missed in the patches to the i40e which should have been to i40evf as well. This is a follow up to (commit ed921559886dd40528) "fix 32 bit build warnings". Signed-off-by: Jesse Brandeburg Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40evf/i40e_common.c | 5 - drivers/net/ethernet/intel/i40evf/i40e_txrx.h | 14 +++--- 2 files changed, 7 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/intel/i40evf/i40e_common.c b/drivers/net/ethernet/intel/i40evf/i40e_common.c index 7435fb3..a4c5a49 100644 --- a/drivers/net/ethernet/intel/i40evf/i40e_common.c +++ b/drivers/net/ethernet/intel/i40evf/i40e_common.c @@ -443,9 +443,6 @@ static i40e_status i40e_aq_get_set_rss_lut(struct i40e_hw *hw, I40E_AQC_SET_RSS_LUT_TABLE_TYPE_SHIFT) & I40E_AQC_SET_RSS_LUT_TABLE_TYPE_MASK)); - cmd_resp->addr_high = cpu_to_le32(high_16_bits((u64)lut)); - cmd_resp->addr_low = cpu_to_le32(lower_32_bits((u64)lut)); - status = i40evf_asq_send_command(hw, &desc, lut, lut_size, NULL); return status; @@ -520,8 +517,6 @@ static i40e_status i40e_aq_get_set_rss_key(struct i40e_hw *hw, I40E_AQC_SET_RSS_KEY_VSI_ID_SHIFT) & I40E_AQC_SET_RSS_KEY_VSI_ID_MASK)); cmd_resp->vsi_id |= cpu_to_le16((u16)I40E_AQC_SET_RSS_KEY_VSI_VALID); - cmd_resp->addr_high = cpu_to_le32(high_16_bits((u64)key)); - cmd_resp->addr_low = cpu_to_le32(lower_32_bits((u64)key)); status = i40evf_asq_send_command(hw, &desc, key, key_size, NULL); diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.h b/drivers/net/ethernet/intel/i40evf/i40e_txrx.h index c4f5a4e..03275eb 100644 --- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.h +++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.h @@ -89,16 +89,16 @@ enum i40e_dyn_idx_t { BIT_ULL(I40E_FILTER_PCTYPE_L2_PAYLOAD)) #define I40E_DEFAULT_RSS_HENA_EXPANDED (I40E_DEFAULT_RSS_HENA | \ - BIT(I40E_FILTER_PCTYPE_NONF_IPV4_TCP_SYN_NO_ACK) | \ - BIT(I40E_FILTER_PCTYPE_NONF_UNICAST_IPV4_UDP) | \ - BIT(I40E_FILTER_PCTYPE_NONF_MULTICAST_IPV4_UDP) | \ - BIT(I40E_FILTER_PCTYPE_NONF_IPV6_TCP_SYN_NO_ACK) | \ - BIT(I40E_FILTER_PCTYPE_NONF_UNICAST_IPV6_UDP) | \ - BIT(I40E_FILTER_PCTYPE_NONF_MULTICAST_IPV6_UDP)) + BIT_ULL(I40E_FILTER_PCTYPE_NONF_IPV4_TCP_SYN_NO_ACK) | \ + BIT_ULL(I40E_FILTER_PCTYPE_NONF_UNICAST_IPV4_UDP) | \ + BIT_ULL(I40E_FILTER_PCTYPE_NONF_MULTICAST_IPV4_UDP) | \ + BIT_ULL(I40E_FILTER_PCTYPE_NONF_IPV6_TCP_SYN_NO_ACK) | \ + BIT_ULL(I40E_FILTER_PCTYPE_NONF_UNICAST_IPV6_UDP) | \ + BIT_ULL(I40E_FILTER_PCTYPE_NONF_MULTICAST_IPV6_UDP)) #define i40e_pf_get_default_rss_hena(pf) \ (((pf)->flags & I40E_FLAG_MULTIPLE_TCP_UDP_RSS_PCTYPE) ? \ - I40E_DEFAULT_RSS_HENA_EXPANDED : I40E_DEFAULT_RSS_HENA) + I40E_DEFAULT_RSS_HENA_EXPANDED : I40E_DEFAULT_RSS_HENA) /* Supported Rx Buffer Sizes */ #define I40E_RXBUFFER_512 512/* Used for packet split */ -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 13/17] i40e: Fix order of checks when enabling/disabling autoneg in ethtool
From: Catherine Sullivan We were previously checking if autoneg was allowed to change before checking if autoneg was changing. We need to do this in the other order or else we will erroneously return EINVAL when autoneg is not changing. Change-ID: Iff9f7d1c9bddc1ad1e5d227d4f42754f90155410 Signed-off-by: Catherine Sullivan Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 29 ++ 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c index 767b1db..6e5f931 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c @@ -661,28 +661,31 @@ static int i40e_set_settings(struct net_device *netdev, /* Check autoneg */ if (autoneg == AUTONEG_ENABLE) { - /* If autoneg is not supported, return error */ - if (!(safe_ecmd.supported & SUPPORTED_Autoneg)) { - netdev_info(netdev, "Autoneg not supported on this phy\n"); - return -EINVAL; - } /* If autoneg was not already enabled */ if (!(hw->phy.link_info.an_info & I40E_AQ_AN_COMPLETED)) { + /* If autoneg is not supported, return error */ + if (!(safe_ecmd.supported & SUPPORTED_Autoneg)) { + netdev_info(netdev, "Autoneg not supported on this phy\n"); + return -EINVAL; + } + /* Autoneg is allowed to change */ config.abilities = abilities.abilities | I40E_AQ_PHY_ENABLE_AN; change = true; } } else { - /* If autoneg is supported 10GBASE_T is the only phy that -* can disable it, so otherwise return error -*/ - if (safe_ecmd.supported & SUPPORTED_Autoneg && - hw->phy.link_info.phy_type != I40E_PHY_TYPE_10GBASE_T) { - netdev_info(netdev, "Autoneg cannot be disabled on this phy\n"); - return -EINVAL; - } /* If autoneg is currently enabled */ if (hw->phy.link_info.an_info & I40E_AQ_AN_COMPLETED) { + /* If autoneg is supported 10GBASE_T is the only PHY +* that can disable it, so otherwise return error +*/ + if (safe_ecmd.supported & SUPPORTED_Autoneg && + hw->phy.link_info.phy_type != + I40E_PHY_TYPE_10GBASE_T) { + netdev_info(netdev, "Autoneg cannot be disabled on this phy\n"); + return -EINVAL; + } + /* Autoneg is allowed to change */ config.abilities = abilities.abilities & ~I40E_AQ_PHY_ENABLE_AN; change = true; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 04/17] i40e/i40evf: use napi_schedule_irqoff()
From: Alexander Duyck The i40e_intr and i40e/i40evf_msix_clean_rings functions run from hard interrupt context or with interrupts already disabled in netpoll. They can use napi_schedule_irqoff() instead of napi_schedule() Signed-off-by: Alexander Duyck Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_main.c | 6 -- drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 +- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index aa2aafd..11dbb29 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -3287,7 +3287,7 @@ static irqreturn_t i40e_msix_clean_rings(int irq, void *data) if (!q_vector->tx.ring && !q_vector->rx.ring) return IRQ_HANDLED; - napi_schedule(&q_vector->napi); + napi_schedule_irqoff(&q_vector->napi); return IRQ_HANDLED; } @@ -3456,6 +3456,8 @@ static irqreturn_t i40e_intr(int irq, void *data) /* only q0 is used in MSI/Legacy mode, and none are used in MSIX */ if (icr0 & I40E_PFINT_ICR0_QUEUE_0_MASK) { + struct i40e_vsi *vsi = pf->vsi[pf->lan_vsi]; + struct i40e_q_vector *q_vector = vsi->q_vectors[0]; /* temporarily disable queue cause for NAPI processing */ u32 qval = rd32(hw, I40E_QINT_RQCTL(0)); @@ -3468,7 +3470,7 @@ static irqreturn_t i40e_intr(int irq, void *data) wr32(hw, I40E_QINT_TQCTL(0), qval); if (!test_bit(__I40E_DOWN, &pf->state)) - napi_schedule(&pf->vsi[pf->lan_vsi]->q_vectors[0]->napi); + napi_schedule_irqoff(&q_vector->napi); } if (icr0 & I40E_PFINT_ICR0_ADMINQ_MASK) { diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c index 4c4340c..1dd5245 100644 --- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c +++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c @@ -334,7 +334,7 @@ static irqreturn_t i40evf_msix_clean_rings(int irq, void *data) if (!q_vector->tx.ring && !q_vector->rx.ring) return IRQ_HANDLED; - napi_schedule(&q_vector->napi); + napi_schedule_irqoff(&q_vector->napi); return IRQ_HANDLED; } -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next 11/17] i40e: fix for PHY NVM interaction problem
From: Carolyn Wyborny This patch fixes a problem where the NVMUpdate Tool, when using the PHY NVM feature, gets bad data from the PHY because of contention on the MDIO interface from get PHY capability calls from the driver during regular operations. The problem is fixed by adding a check if media is available before calling get PHY capability function because that bit is not set when device is in PHY interaction mode. Change-ID: Ib89991b0f841808dd92410f5e8683d6ee3301cd0 Signed-off-by: Carolyn Wyborny Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_common.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c b/drivers/net/ethernet/intel/i40e/i40e_common.c index 7f456e9..c51f2fb 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_common.c +++ b/drivers/net/ethernet/intel/i40e/i40e_common.c @@ -2278,13 +2278,15 @@ i40e_status i40e_update_link_info(struct i40e_hw *hw) if (status) return status; - status = i40e_aq_get_phy_capabilities(hw, false, false, &abilities, - NULL); - if (status) - return status; + if (hw->phy.link_info.link_info & I40E_AQ_MEDIA_AVAILABLE) { + status = i40e_aq_get_phy_capabilities(hw, false, false, + &abilities, NULL); + if (status) + return status; - memcpy(hw->phy.link_info.module_type, &abilities.module_type, - sizeof(hw->phy.link_info.module_type)); + memcpy(hw->phy.link_info.module_type, &abilities.module_type, + sizeof(hw->phy.link_info.module_type)); + } return status; } -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
pull-request: wireless-drivers 2015-10-17
Hi Dave, few small fixes I would like to get to 4.3 still. Please let me know if there are any problems. Kalle The following changes since commit 76d164f582150fd0259ec0fcbc485470bcd8033e: ath10k: fix DMA related firmware crashes on multiple devices (2015-09-26 20:48:25 +0300) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers.git tags/wireless-drivers-for-davem-2015-10-17 for you to fetch changes up to de28a05ee28e36fa1fc68fcfc7c6cc1c6f7c270c: Merge tag 'iwlwifi-for-kalle-2015-10-05' of git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-fixes (2015-10-07 11:12:01 +0300) iwlwifi: * mvm: flush fw_dump_wk when mvm fails to start * mvm: init card correctly on ctkill exit check * pci: add a few more PCI subvendor IDs for the 7265 series * fix firmware filename for 3160 * mvm: clear csa countdown when AP is stopped * mvm: fix D3 firmware PN programming * dvm: fix D3 firmware PN programming * mvm: fix D3 CCMP TX PN assignment rtlwifi: * rtl8821ae: Fix system lockups on boot Andrei Otcheretianski (1): iwlwifi: mvm: flush fw_dump_wk when mvm fails to start Arik Nemtsov (1): iwlwifi: mvm: init card correctly on ctkill exit check Avraham Stern (1): iwlwifi: mvm: clear csa countdown when AP is stopped Johannes Berg (4): iwlwifi: mvm: fix D3 CCMP TX PN assignment iwlwifi: dvm: fix D3 firmware PN programming iwlwifi: mvm: fix D3 firmware PN programming iwlwifi: fix firmware filename for 3160 Kalle Valo (1): Merge tag 'iwlwifi-for-kalle-2015-10-05' of git://git.kernel.org/.../iwlwifi/iwlwifi-fixes Larry Finger (1): rtlwifi: rtl8821ae: Fix system lockups on boot Luca Coelho (1): iwlwifi: pci: add a few more PCI subvendor IDs for the 7265 series drivers/net/wireless/iwlwifi/dvm/lib.c |2 +- drivers/net/wireless/iwlwifi/iwl-7000.c |2 +- drivers/net/wireless/iwlwifi/mvm/d3.c | 27 +++ drivers/net/wireless/iwlwifi/mvm/fw.c |4 ++-- drivers/net/wireless/iwlwifi/mvm/mac80211.c |1 + drivers/net/wireless/iwlwifi/mvm/mvm.h |5 + drivers/net/wireless/iwlwifi/mvm/ops.c |1 + drivers/net/wireless/iwlwifi/pcie/drv.c |5 + drivers/net/wireless/rtlwifi/pci.h |2 ++ drivers/net/wireless/rtlwifi/rtl8821ae/hw.c | 17 + drivers/net/wireless/rtlwifi/rtl8821ae/sw.c |5 + drivers/net/wireless/rtlwifi/wifi.h |3 +++ 12 files changed, 54 insertions(+), 20 deletions(-) -- Kalle Valo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] uapi: add mpls_iptunnel.h
On 10/16/15, 4:12 PM, Stephen Hemminger wrote: > Add missing rule to export mpls iptunnel header needed by iproute2 > > Signed-off-by: Stephen Hemminger Acked-by: Roopa Prabhu Thanks Stephen. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 5/7] tcp: skb_mstamp_after helper
a helper to prepare the first main RACK patch. Signed-off-by: Yuchung Cheng Signed-off-by: Neal Cardwell Signed-off-by: Eric Dumazet --- include/linux/skbuff.h | 9 + 1 file changed, 9 insertions(+) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 4398411..24f4dfd 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -463,6 +463,15 @@ static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1, return delta_us; } +static inline bool skb_mstamp_after(const struct skb_mstamp *t1, + const struct skb_mstamp *t0) +{ + s32 diff = t1->stamp_jiffies - t0->stamp_jiffies; + + if (!diff) + diff = t1->stamp_us - t0->stamp_us; + return diff > 0; +} /** * struct sk_buff - socket buffer -- 2.6.0.rc2.230.g3dd15c0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 4/7] tcp: add tcp_tsopt_ecr_before helper
a helper to prepare the main RACK patch Signed-off-by: Yuchung Cheng Signed-off-by: Neal Cardwell Signed-off-by: Eric Dumazet --- net/ipv4/tcp_input.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 5ad08d8..c304b5f 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2250,14 +2250,19 @@ static inline void tcp_moderate_cwnd(struct tcp_sock *tp) tp->snd_cwnd_stamp = tcp_time_stamp; } +static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when) +{ + return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && + before(tp->rx_opt.rcv_tsecr, when); +} + /* Nothing was retransmitted or returned timestamp is less * than timestamp of the first retransmission. */ static inline bool tcp_packet_delayed(const struct tcp_sock *tp) { return !tp->retrans_stamp || - (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && -before(tp->rx_opt.rcv_tsecr, tp->retrans_stamp)); + tcp_tsopt_ecr_before(tp, tp->retrans_stamp); } /* Undo procedures. */ -- 2.6.0.rc2.230.g3dd15c0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 6/7] tcp: track the packet timings in RACK
This patch is the first half of the RACK loss recovery. RACK loss recovery uses the notion of time instead of packet sequence (FACK) or counts (dupthresh). It's inspired by the previous FACK heuristic in tcp_mark_lost_retrans(): when a limited transmit (new data packet) is sacked, then current retransmitted sequence below the newly sacked sequence must been lost, since at least one round trip time has elapsed. But it has several limitations: 1) can't detect tail drops since it depends on limited transmit 2) is disabled upon reordering (assumes no reordering) 3) only enabled in fast recovery ut not timeout recovery RACK (Recently ACK) addresses these limitations with the notion of time instead: a packet P1 is lost if a later packet P2 is s/acked, as at least one round trip has passed. Since RACK cares about the time sequence instead of the data sequence of packets, it can detect tail drops when later retransmission is s/acked while FACK or dupthresh can't. For reordering RACK uses a dynamically adjusted reordering window ("reo_wnd") to reduce false positives on ever (small) degree of reordering. This patch implements tcp_advanced_rack() which tracks the most recent transmission time among the packets that have been delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp is the key to determine which packet has been lost. Consider an example that the sender sends six packets: T1: P1 (lost) T2: P2 T3: P3 T4: P4 T100: sack of P2. rack.mstamp = T2 T101: retransmit P1 T102: sack of P2,P3,P4. rack.mstamp = T4 T205: ACK of P4 since the hole is repaired. rack.mstamp = T101 We need to be careful about spurious retransmission because it may falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK to falsely mark all packets lost, just like a spurious timeout. We identify spurious retransmission by the ACK's TS echo value. If TS option is not applicable but the retransmission is acknowledged less than min-RTT ago, it is likely to be spurious. We refrain from using the transmission time of these spurious retransmissions. The second half is implemented in the next patch that marks packet lost using RACK timestamp. Signed-off-by: Yuchung Cheng Signed-off-by: Neal Cardwell Signed-off-by: Eric Dumazet --- include/linux/tcp.h | 6 ++ include/net/tcp.h| 5 + net/ipv4/Makefile| 1 + net/ipv4/tcp_input.c | 14 ++ net/ipv4/tcp_minisocks.c | 2 ++ net/ipv4/tcp_recovery.c | 32 6 files changed, 60 insertions(+) create mode 100644 net/ipv4/tcp_recovery.c diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 8c54863..5dce970 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -194,6 +194,12 @@ struct tcp_sock { u32 window_clamp; /* Maximal window to advertise */ u32 rcv_ssthresh; /* Current window clamp */ + /* Information of the most recently (s)acked skb */ + struct tcp_rack { + struct skb_mstamp mstamp; /* (Re)sent time of the skb */ + u8 advanced; /* mstamp advanced since last lost marking */ + u8 reord;/* reordering detected */ + } rack; u16 advmss; /* Advertised MSS */ u8 unused; u8 nonagle : 4,/* Disable Nagle algorithm? */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 4575f0e..aee5f23 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1750,6 +1750,11 @@ int tcpv4_offload_init(void); void tcp_v4_init(void); void tcp_init(void); +/* tcp_recovery.c */ + +extern void tcp_rack_advance(struct tcp_sock *tp, +const struct skb_mstamp *xmit_time, u8 sacked); + /* * Save and compile IPv4 options, return a pointer to it */ diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile index 89aacb6..c29809f 100644 --- a/net/ipv4/Makefile +++ b/net/ipv4/Makefile @@ -8,6 +8,7 @@ obj-y := route.o inetpeer.o protocol.o \ inet_timewait_sock.o inet_connection_sock.o \ tcp.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o \ tcp_minisocks.o tcp_cong.o tcp_metrics.o tcp_fastopen.o \ +tcp_recovery.o \ tcp_offload.o datagram.o raw.o udp.o udplite.o \ udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \ fib_frontend.o fib_semantics.o fib_trie.o \ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c304b5f..21a9ea4 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1173,6 +1173,8 @@ static u8 tcp_sacktag_one(struct sock *sk, return sacked; if (!(sacked & TCPCB_SACKED_ACKED)) { + tcp_rack_advance(tp, xmit_time, sacked); + if (sacked & TCPCB_SACKED_RETRANS) { /* If the segment is not tagged as lost, * we do not clear RETRANS, believing @@ -2256,6 +2258,16
[PATCH net-next 2/7] tcp: track min RTT using windowed min-filter
Kathleen Nichols' algorithm for tracking the minimum RTT of a data stream over some measurement window. It uses constant space and constant time per update. Yet it almost always delivers the same minimum as an implementation that has to keep all the data in the window. The measurement window is tunable via sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes. The algorithm keeps track of the best, 2nd best & 3rd best min values, maintaining an invariant that the measurement time of the n'th best >= n-1'th best. It also makes sure that the three values are widely separated in the time window since that bounds the worse case error when that data is monotonically increasing over the window. Upon getting a new min, we can forget everything earlier because it has no value - the new min is less than everything else in the window by definition and it's the most recent. So we restart fresh on every new min and overwrites the 2nd & 3rd choices. The same property holds for the 2nd & 3rd best. Therefore we have to maintain two invariants to maximize the information in the samples, one on values (1st.v <= 2nd.v <= 3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <= now). These invariants determine the structure of the code The RTT input to the windowed filter is the minimum RTT measured from ACK or SACK, or as the last resort from TCP timestamps. The accessor tcp_min_rtt() returns the minimum RTT seen in the window. ~0U indicates it is not available. The minimum is 1usec even if the true RTT is below that. Signed-off-by: Yuchung Cheng Signed-off-by: Neal Cardwell Signed-off-by: Eric Dumazet --- Documentation/networking/ip-sysctl.txt | 8 include/linux/tcp.h| 3 ++ include/net/tcp.h | 7 +++ net/ipv4/sysctl_net_ipv4.c | 7 +++ net/ipv4/tcp.c | 1 + net/ipv4/tcp_input.c | 78 +++--- net/ipv4/tcp_minisocks.c | 1 + 7 files changed, 100 insertions(+), 5 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index ebe94f2..502d6a5 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -384,6 +384,14 @@ tcp_mem - vector of 3 INTEGERs: min, pressure, max Defaults are calculated at boot time from amount of available memory. +tcp_min_rtt_wlen - INTEGER + The window length of the windowed min filter to track the minimum RTT. + A shorter window lets a flow more quickly pick up new (higher) + minimum RTT when it is moved to a longer path (e.g., due to traffic + engineering). A longer window makes the filter more resistant to RTT + inflations such as transient congestion. The unit is seconds. + Default: 300 + tcp_moderate_rcvbuf - BOOLEAN If set, TCP performs receive buffer auto-tuning, attempting to automatically size the buffer (no greater than tcp_rmem[2]) to diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 86a7eda..90edef5 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -217,6 +217,9 @@ struct tcp_sock { u32 mdev_max_us;/* maximal mdev for the last rtt period */ u32 rttvar_us; /* smoothed mdev_max*/ u32 rtt_seq;/* sequence number to update rttvar */ + struct rtt_meas { + u32 rtt, ts;/* RTT in usec and sampling time in jiffies. */ + } rtt_min[3]; u32 packets_out;/* Packets which are "in flight"*/ u32 retrans_out;/* Retransmitted packets out*/ diff --git a/include/net/tcp.h b/include/net/tcp.h index a6be56d..4575f0e 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -279,6 +279,7 @@ extern int sysctl_tcp_limit_output_bytes; extern int sysctl_tcp_challenge_ack_limit; extern unsigned int sysctl_tcp_notsent_lowat; extern int sysctl_tcp_min_tso_segs; +extern int sysctl_tcp_min_rtt_wlen; extern int sysctl_tcp_autocorking; extern int sysctl_tcp_invalid_ratelimit; extern int sysctl_tcp_pacing_ss_ratio; @@ -671,6 +672,12 @@ static inline bool tcp_ca_dst_locked(const struct dst_entry *dst) return dst_metric_locked(dst, RTAX_CC_ALGO); } +/* Minimum RTT in usec. ~0 means not available. */ +static inline u32 tcp_min_rtt(const struct tcp_sock *tp) +{ + return tp->rtt_min[0].rtt; +} + /* Compute the actual receive window we are currently advertising. * Rcv_nxt can be after the window if our peer push more data * than the offered window. diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 894da3a..13ab434 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -577,6 +577,13 @@ static struct ctl_table ipv4_table[] = { .proc_handler = proc_dointvec }, { + .procname = "tcp_min_rtt_
[PATCH net-next 7/7] tcp: use RACK to detect losses
This patch implements the second half of RACK that uses the the most recent transmit time among all delivered packets to detect losses. tcp_rack_mark_lost() is called upon receiving a dubious ACK. It then checks if an not-yet-sacked packet was sent at least "reo_wnd" prior to the sent time of the most recently delivered. If so the packet is deemed lost. The "reo_wnd" reordering window starts with 1msec for fast loss detection and changes to min-RTT/4 when reordering is observed. We found 1msec accommodates well on tiny degree of reordering (<3 pkts) on faster links. We use min-RTT instead of SRTT because reordering is more of a path property but SRTT can be inflated by self-inflicated congestion. The factor of 4 is borrowed from the delayed early retransmit and seems to work reasonably well. Since RACK is still experimental, it is now used as a supplemental loss detection on top of existing algorithms. It is only effective after the fast recovery starts or after the timeout occurs. The fast recovery is still triggered by FACK and/or dupack threshold instead of RACK. We introduce a new sysctl net.ipv4.tcp_recovery for future experiments of loss recoveries. For now RACK can be disabled by setting it to 0. Signed-off-by: Yuchung Cheng Signed-off-by: Neal Cardwell Signed-off-by: Eric Dumazet --- Documentation/networking/ip-sysctl.txt | 9 include/net/tcp.h | 9 net/ipv4/sysctl_net_ipv4.c | 7 net/ipv4/tcp_input.c | 9 +++- net/ipv4/tcp_recovery.c| 77 ++ 5 files changed, 109 insertions(+), 2 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 502d6a5..85752c8 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -433,6 +433,15 @@ tcp_orphan_retries - INTEGER you should think about lowering this value, such sockets may consume significant resources. Cf. tcp_max_orphans. +tcp_recovery - INTEGER + This value is a bitmap to enable various experimental loss recovery + features. + + RACK: 0x1 enables the RACK loss detection for fast detection of lost + retransmissions and tail drops. + + Default: 0x1 + tcp_reordering - INTEGER Initial reordering level of packets in a TCP stream. TCP stack can then dynamically adjust flow reordering level diff --git a/include/net/tcp.h b/include/net/tcp.h index aee5f23..f615789 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -567,6 +567,7 @@ void tcp_resume_early_retransmit(struct sock *sk); void tcp_rearm_rto(struct sock *sk); void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req); void tcp_reset(struct sock *sk); +void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb); /* tcp_timer.c */ void tcp_init_xmit_timers(struct sock *); @@ -1752,6 +1753,14 @@ void tcp_init(void); /* tcp_recovery.c */ +/* Flags to enable various loss recovery features. See below */ +extern int sysctl_tcp_recovery; + +/* Use TCP RACK to detect (some) tail and retransmit losses */ +#define TCP_RACK_LOST_RETRANS 0x1 + +extern int tcp_rack_mark_lost(struct sock *sk); + extern void tcp_rack_advance(struct tcp_sock *tp, const struct skb_mstamp *xmit_time, u8 sacked); diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 13ab434..25300c5 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -496,6 +496,13 @@ static struct ctl_table ipv4_table[] = { .proc_handler = proc_dointvec }, { + .procname = "tcp_recovery", + .data = &sysctl_tcp_recovery, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { .procname = "tcp_reordering", .data = &sysctl_tcp_reordering, .maxlen = sizeof(int), diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 21a9ea4..cf0f954 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -881,6 +881,7 @@ static void tcp_update_reordering(struct sock *sk, const int metric, if (metric > 0) tcp_disable_early_retrans(tp); + tp->rack.reord = 1; } /* This must be called before lost_out is incremented */ @@ -906,8 +907,7 @@ static void tcp_skb_mark_lost(struct tcp_sock *tp, struct sk_buff *skb) } } -static void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, - struct sk_buff *skb) +void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb) { tcp_verify_retransmit_hint(tp, skb); @@ -2806,6 +2806,11 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
[PATCH net-next 0/7] RACK loss detection
RACK (Recent ACK) loss recovery uses the notion of time instead of packet sequence (FACK) or counts (dupthresh). It's inspired by the FACK heuristic in tcp_mark_lost_retrans(): when a limited transmit (new data packet) is sacked in recovery, then any retransmission sent before that newly sacked packet was sent must have been lost, since at least one round trip time has elapsed. But that existing heuristic from tcp_mark_lost_retrans() has several limitations: 1) it can't detect tail drops since it depends on limited transmit 2) it's disabled upon reordering (assumes no reordering) 3) it's only enabled in fast recovery but not timeout recovery RACK addresses these limitations with a core idea: an unacknowledged packet P1 is deemed lost if a packet P2 that was sent later is is s/acked, since at least one round trip has passed. Since RACK cares about the time sequence instead of the data sequence of packets, it can detect tail drops when a later retransmission is s/acked, while FACK or dupthresh can't. For reordering RACK uses a dynamically adjusted reordering window ("reo_wnd") to reduce false positives on ever (small) degree of reordering, similar to the delayed Early Retransmit. In the current patch set RACK is only a supplemental loss detection and does not trigger fast recovery. However we are developing RACK to replace or consolidate FACK/dupthresh, early retransmit, and thin-dupack. These heuristics all implicitly bear the time notion. For example, the delayed Early Retransmit is simply applying RACK to trigger the fast recovery with small inflight. RACK requires measuring the minimum RTT. Tracking a global min is less robust due to traffic engineering pathing changes. Therefore it uses a windowed filter by Kathleen Nichols. The min RTT can also be useful for various other purposes like congestion control or stat monitoring. This patch has been used on Google servers for well over 1 year. RACK has also been implemented in the QUIC protocol. We are submitting an IETF draft as well. Yuchung Cheng (7): tcp: apply Kern's check on RTTs used for congestion control tcp: track min RTT using windowed min-filter tcp: remove tcp_mark_lost_retrans() tcp: add tcp_tsopt_ecr_before helper tcp: skb_mstamp_after helper tcp: track the packet timings in RACK tcp: use RACK to detect losses Documentation/networking/ip-sysctl.txt | 17 include/linux/skbuff.h | 9 ++ include/linux/tcp.h| 11 +- include/net/tcp.h | 21 net/ipv4/Makefile | 1 + net/ipv4/sysctl_net_ipv4.c | 14 +++ net/ipv4/tcp.c | 1 + net/ipv4/tcp_input.c | 180 +++-- net/ipv4/tcp_minisocks.c | 3 + net/ipv4/tcp_output.c | 6 -- net/ipv4/tcp_recovery.c| 109 11 files changed, 286 insertions(+), 86 deletions(-) create mode 100644 net/ipv4/tcp_recovery.c -- 2.6.0.rc2.230.g3dd15c0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 3/7] tcp: remove tcp_mark_lost_retrans()
Remove the existing lost retransmit detection because RACK subsumes it completely. This also stops the overloading the ack_seq field of the skb control block. Signed-off-by: Yuchung Cheng Signed-off-by: Neal Cardwell Signed-off-by: Eric Dumazet --- include/linux/tcp.h | 2 -- net/ipv4/tcp_input.c | 65 --- net/ipv4/tcp_output.c | 6 - 3 files changed, 73 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 90edef5..8c54863 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -283,8 +283,6 @@ struct tcp_sock { int lost_cnt_hint; u32 retransmit_high;/* L-bits may be on up to this seqno */ - u32 lost_retrans_low; /* Sent seq after any rxmit (lowest) */ - u32 prior_ssthresh; /* ssthresh saved at recovery start */ u32 high_seq; /* snd_nxt at onset of congestion */ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index e177386..5ad08d8 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1048,70 +1048,6 @@ static bool tcp_is_sackblock_valid(struct tcp_sock *tp, bool is_dsack, return !before(start_seq, end_seq - tp->max_window); } -/* Check for lost retransmit. This superb idea is borrowed from "ratehalving". - * Event "B". Later note: FACK people cheated me again 8), we have to account - * for reordering! Ugly, but should help. - * - * Search retransmitted skbs from write_queue that were sent when snd_nxt was - * less than what is now known to be received by the other end (derived from - * highest SACK block). Also calculate the lowest snd_nxt among the remaining - * retransmitted skbs to avoid some costly processing per ACKs. - */ -static void tcp_mark_lost_retrans(struct sock *sk, int *flag) -{ - const struct inet_connection_sock *icsk = inet_csk(sk); - struct tcp_sock *tp = tcp_sk(sk); - struct sk_buff *skb; - int cnt = 0; - u32 new_low_seq = tp->snd_nxt; - u32 received_upto = tcp_highest_sack_seq(tp); - - if (!tcp_is_fack(tp) || !tp->retrans_out || - !after(received_upto, tp->lost_retrans_low) || - icsk->icsk_ca_state != TCP_CA_Recovery) - return; - - tcp_for_write_queue(skb, sk) { - u32 ack_seq = TCP_SKB_CB(skb)->ack_seq; - - if (skb == tcp_send_head(sk)) - break; - if (cnt == tp->retrans_out) - break; - if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una)) - continue; - - if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS)) - continue; - - /* TODO: We would like to get rid of tcp_is_fack(tp) only -* constraint here (see above) but figuring out that at -* least tp->reordering SACK blocks reside between ack_seq -* and received_upto is not easy task to do cheaply with -* the available datastructures. -* -* Whether FACK should check here for tp->reordering segs -* in-between one could argue for either way (it would be -* rather simple to implement as we could count fack_count -* during the walk and do tp->fackets_out - fack_count). -*/ - if (after(received_upto, ack_seq)) { - TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS; - tp->retrans_out -= tcp_skb_pcount(skb); - *flag |= FLAG_LOST_RETRANS; - tcp_skb_mark_lost_uncond_verify(tp, skb); - NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSTRETRANSMIT); - } else { - if (before(ack_seq, new_low_seq)) - new_low_seq = ack_seq; - cnt += tcp_skb_pcount(skb); - } - } - - if (tp->retrans_out) - tp->lost_retrans_low = new_low_seq; -} - static bool tcp_check_dsack(struct sock *sk, const struct sk_buff *ack_skb, struct tcp_sack_block_wire *sp, int num_sacks, u32 prior_snd_una) @@ -1838,7 +1774,6 @@ advance_sp: ((inet_csk(sk)->icsk_ca_state != TCP_CA_Loss) || tp->undo_marker)) tcp_update_reordering(sk, tp->fackets_out - state->reord, 0); - tcp_mark_lost_retrans(sk, &state->flag); tcp_verify_left_out(tp); out: diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 6e79fcb..49730e5 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2655,8 +2655,6 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb) net_dbg_ratelimited("retrans_out leaked\n"); } #endif - if (!tp->retrans_out) - tp->lost_retrans_l
[PATCH net-next 1/7] tcp: apply Kern's check on RTTs used for congestion control
Currently ca_seq_rtt_us does not use Kern's check. Fix that by checking if any packet acked is a retransmit, for both RTT used for RTT estimation and congestion control. Fixes: 5b08e47ca ("tcp: prefer packet timing to TS-ECR for RTT") Signed-off-by: Yuchung Cheng Signed-off-by: Neal Cardwell Signed-off-by: Eric Dumazet --- net/ipv4/tcp_input.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 3b35c3f..38743e5 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2925,9 +2925,6 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag, * Karn's algorithm forbids taking RTT if some retransmitted data * is acked (RFC6298). */ - if (flag & FLAG_RETRANS_DATA_ACKED) - seq_rtt_us = -1L; - if (seq_rtt_us < 0) seq_rtt_us = sack_rtt_us; @@ -3169,7 +3166,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets, flag |= FLAG_SACK_RENEGING; skb_mstamp_get(&now); - if (likely(first_ackt.v64)) { + if (likely(first_ackt.v64) && !(flag & FLAG_RETRANS_DATA_ACKED)) { seq_rtt_us = skb_mstamp_us_delta(&now, &first_ackt); ca_rtt_us = skb_mstamp_us_delta(&now, &last_ackt); } -- 2.6.0.rc2.230.g3dd15c0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] drivers: net: cpsw: add phy-handle parsing
add the ability to parse "phy-handle". This is needed for phys, which have a DT node, and need to parse DT properties. Signed-off-by: Heiko Schocher --- Changes in v2: None Documentation/devicetree/bindings/net/cpsw.txt | 1 + drivers/net/ethernet/ti/cpsw.c | 15 +++ 2 files changed, 12 insertions(+), 4 deletions(-) diff --git a/Documentation/devicetree/bindings/net/cpsw.txt b/Documentation/devicetree/bindings/net/cpsw.txt index a9df21a..a2cae4e 100644 --- a/Documentation/devicetree/bindings/net/cpsw.txt +++ b/Documentation/devicetree/bindings/net/cpsw.txt @@ -39,6 +39,7 @@ Required properties: Optional properties: - dual_emac_res_vlan : Specifies VID to be used to segregate the ports - mac-address : See ethernet.txt file in the same directory +- phy-handle : See ethernet.txt file in the same directory Note: "ti,hwmods" field is used to fetch the base address and irq resources from TI, omap hwmod data base during device registration. diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c index 8fc90f1..874fb29 100644 --- a/drivers/net/ethernet/ti/cpsw.c +++ b/drivers/net/ethernet/ti/cpsw.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include #include @@ -365,6 +366,7 @@ struct cpsw_priv { spinlock_t lock; struct platform_device *pdev; struct net_device *ndev; + struct device_node *phy_node; struct napi_struct napi_rx; struct napi_struct napi_tx; struct device *dev; @@ -1145,7 +1147,11 @@ static void cpsw_slave_open(struct cpsw_slave *slave, struct cpsw_priv *priv) cpsw_ale_add_mcast(priv->ale, priv->ndev->broadcast, 1 << slave_port, 0, 0, ALE_MCAST_FWD_2); - slave->phy = phy_connect(priv->ndev, slave->data->phy_id, + if (priv->phy_node) + slave->phy = of_phy_connect(priv->ndev, priv->phy_node, +&cpsw_adjust_link, 0, slave->data->phy_if); + else + slave->phy = phy_connect(priv->ndev, slave->data->phy_id, &cpsw_adjust_link, slave->data->phy_if); if (IS_ERR(slave->phy)) { dev_err(priv->dev, "phy %s not found on slave %d\n", @@ -1934,11 +1940,12 @@ static void cpsw_slave_init(struct cpsw_slave *slave, struct cpsw_priv *priv, slave->port_vlan = data->dual_emac_res_vlan; } -static int cpsw_probe_dt(struct cpsw_platform_data *data, +static int cpsw_probe_dt(struct cpsw_priv *priv, struct platform_device *pdev) { struct device_node *node = pdev->dev.of_node; struct device_node *slave_node; + struct cpsw_platform_data *data = &priv->data; int i = 0, ret; u32 prop; @@ -2029,6 +2036,7 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data, if (strcmp(slave_node->name, "slave")) continue; + priv->phy_node = of_parse_phandle(slave_node, "phy-handle", 0); parp = of_get_property(slave_node, "phy_id", &lenp); if ((parp == NULL) || (lenp != (sizeof(void *) * 2))) { dev_err(&pdev->dev, "Missing slave[%d] phy_id property\n", i); @@ -2044,7 +2052,6 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data, } snprintf(slave_data->phy_id, sizeof(slave_data->phy_id), PHY_ID_FMT, mdio->name, phyid); - slave_data->phy_if = of_get_phy_mode(slave_node); if (slave_data->phy_if < 0) { dev_err(&pdev->dev, "Missing or malformed slave[%d] phy-mode property\n", @@ -2240,7 +2247,7 @@ static int cpsw_probe(struct platform_device *pdev) /* Select default pin state */ pinctrl_pm_select_default_state(&pdev->dev); - if (cpsw_probe_dt(&priv->data, pdev)) { + if (cpsw_probe_dt(priv, pdev)) { dev_err(&pdev->dev, "cpsw: platform data missing\n"); ret = -ENODEV; goto clean_runtime_disable_ret; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] net: phy: smsc: disable energy detect mode
On some boards the energy enable detect mode leads in trouble with some switches, so make the enabling of this mode configurable through DT. Signed-off-by: Heiko Schocher --- Changes in v2: - add comments from Florian Fainelli - I did not change disable property name into enable because I fear to break existing behaviour - add smsc vendor prefix - remove CONFIG_OF and use __maybe_unused - introduce "phy-handle" ability into ti,cpsw driver, so I can remove bogus: if (!of_node && dev->parent->of_node) of_node = dev->parent->of_node; construct. Therefore new patch for the ti,cpsw driver is necessary. .../devicetree/bindings/net/smsc-lan87xx.txt | 24 ++ drivers/net/phy/smsc.c | 19 - 2 files changed, 38 insertions(+), 5 deletions(-) create mode 100644 Documentation/devicetree/bindings/net/smsc-lan87xx.txt diff --git a/Documentation/devicetree/bindings/net/smsc-lan87xx.txt b/Documentation/devicetree/bindings/net/smsc-lan87xx.txt new file mode 100644 index 000..974edd5 --- /dev/null +++ b/Documentation/devicetree/bindings/net/smsc-lan87xx.txt @@ -0,0 +1,24 @@ +SMSC LAN87xx Ethernet PHY + +Some boards require special tuning values. Configure them +through an Ethernet OF device node. + +Optional properties: + +- smsc,disable-energy-detect: + If set, do not enable energy detect mode for the SMSC phy. + default: enable energy detect mode + +Examples: +smsc phy with disabled energy detect mode on an am335x based board. +&davinci_mdio { + pinctrl-names = "default", "sleep"; + pinctrl-0 = <&davinci_mdio_default>; + pinctrl-1 = <&davinci_mdio_sleep>; + status = "okay"; + + ethernetphy0: ethernet-phy@0 { + reg = <0>; + smsc,disable-energy-detect; + }; +}; diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c index 70b0895..dc2da87 100644 --- a/drivers/net/phy/smsc.c +++ b/drivers/net/phy/smsc.c @@ -43,16 +43,25 @@ static int smsc_phy_ack_interrupt(struct phy_device *phydev) static int smsc_phy_config_init(struct phy_device *phydev) { + int __maybe_unused len; + struct device *dev __maybe_unused = &phydev->dev; + struct device_node *of_node __maybe_unused = dev->of_node; int rc = phy_read(phydev, MII_LAN83C185_CTRL_STATUS); + int enable_energy = 1; if (rc < 0) return rc; - /* Enable energy detect mode for this SMSC Transceivers */ - rc = phy_write(phydev, MII_LAN83C185_CTRL_STATUS, - rc | MII_LAN83C185_EDPWRDOWN); - if (rc < 0) - return rc; + if (of_find_property(of_node, "smsc,disable-energy-detect", &len)) + enable_energy = 0; + + if (enable_energy) { + /* Enable energy detect mode for this SMSC Transceivers */ + rc = phy_write(phydev, MII_LAN83C185_CTRL_STATUS, + rc | MII_LAN83C185_EDPWRDOWN); + if (rc < 0) + return rc; + } return smsc_phy_ack_interrupt(phydev); } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/2] net, phy, smsc: add posibility to disable energy detect mode
On some boards the energy enable detect mode leads in trouble with some switches, so make the enabling of this mode configurable through DT. Therefore the property "smsc,disable-energy-detect" is introduced. Patch 1 introduces phy-handle support for the ti,cpsw driver. This is needed now for the smsc phy. Patch 2 adds the disable energy mode functionality to the smsc phy Changes in v2: - add comments from Florian Fainelli - I did not change disable property name into enable because I fear to break existing behaviour - add smsc vendor prefix - remove CONFIG_OF and use __maybe_unused - introduce "phy-handle" ability into ti,cpsw driver, so I can remove bogus: if (!of_node && dev->parent->of_node) of_node = dev->parent->of_node; construct. Therefore new patch for the ti,cpsw driver is necessary. Heiko Schocher (2): drivers: net: cpsw: add phy-handle parsing net: phy: smsc: disable energy detect mode Documentation/devicetree/bindings/net/cpsw.txt | 1 + .../devicetree/bindings/net/smsc-lan87xx.txt | 24 ++ drivers/net/ethernet/ti/cpsw.c | 15 ++ drivers/net/phy/smsc.c | 19 - 4 files changed, 50 insertions(+), 9 deletions(-) create mode 100644 Documentation/devicetree/bindings/net/smsc-lan87xx.txt -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: phy: smsc: disable energy detect mode
Hello Florian, Am 16.10.2015 um 18:27 schrieb Florian Fainelli: 2015-10-13 21:17 GMT-07:00 Heiko Schocher : Hello Florian, Am 13.10.2015 um 21:26 schrieb Florian Fainelli: On 12/10/15 22:13, Heiko Schocher wrote: On some boards the energy enable detect mode leads in trouble with some switches, so make the enabling of this mode configurable through DT. Signed-off-by: Heiko Schocher --- .../devicetree/bindings/net/smsc-lan87xx.txt | 19 + drivers/net/phy/smsc.c | 24 +- 2 files changed, 38 insertions(+), 5 deletions(-) create mode 100644 Documentation/devicetree/bindings/net/smsc-lan87xx.txt diff --git a/Documentation/devicetree/bindings/net/smsc-lan87xx.txt b/Documentation/devicetree/bindings/net/smsc-lan87xx.txt new file mode 100644 index 000..39aa1dc --- /dev/null +++ b/Documentation/devicetree/bindings/net/smsc-lan87xx.txt @@ -0,0 +1,19 @@ +SMSC LAN87xx Ethernet PHY + +Some boards require special tuning values. Configure them +through an Ethernet OF device node. + +Optional properties: + +- disable-energy-detect: + If set, do not enable energy detect mode for the SMSC phy. + default: enable energy detect mode Although energy detection is something that is implemented by many PHYs, I am not sure a generic property is suitable here, I would prefix that with the SMSC vendor prefix here to make it clear this only applies to this PHY. Hmm... but all PHYs should be able to enable, disable it in some way, or? It may not always be controlled directly at the PHY level, sometimes this is something that needs cooperation with the Ethernet MAC as well in case of integrated designs. Ah, ok! Would not you want to make it a reverse property here though, something like this: smsc,energy-detect: boolean, when present indicates the PHY reliably supports energy detection Yes, that was also my first thought, but currently, on this PHYs energy detect mode is on ... and if I introduce such a property, it will disable it for all existing boards, because property is missing ... so, maybe I break boards ... Fair enough, how about smsc,disabled-energy-detect or something like that then? Yes, changed it to "smsc,disable-energy-detect" bye, Heiko -- DENX Software Engineering GmbH, Managing Director: Wolfgang Denk HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
On 10/16/15 4:44 PM, Eric W. Biederman wrote: Alexei Starovoitov writes: We can argue about api for 2nd, whether it's mount with fd=1234 string or else, but for the first mount style doesn't make sense. Why does mount not make sense? It is exactly what you are looking for so why does it not make sense? hmm, how do you get a new fd back after mounting it? Note, open cannot be overloaded, so we end up with BPF_NEW_FD anyway, but now it's more convoluted and empty mounts are everywhere. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 net-next] RDS: fix rds-ping deadlock over TCP transport
Sowmini found hang with rds-ping while testing RDS over TCP. Its a corner case and doesn't happen always. The issue is not reproducible with IB transport. Its clear from below dump why we see it with RDS TCP. [] do_tcp_setsockopt+0xb5/0x740 [] tcp_setsockopt+0x24/0x30 [] sock_common_setsockopt+0x14/0x20 [] rds_tcp_xmit_prepare+0x5d/0x70 [rds_tcp] [] rds_send_xmit+0xd7/0x740 [rds] [] rds_send_pong+0x142/0x180 [rds] [] rds_recv_incoming+0x274/0x330 [rds] [] ? ttwu_queue+0x11e/0x130 [] ? skb_copy_bits+0x6d/0x2c0 [] rds_tcp_data_recv+0x2f0/0x3d0 [rds_tcp] [] tcp_read_sock+0x96/0x1c0 [] ? rds_tcp_recv_init+0x40/0x40 [rds_tcp] [] ? sock_def_write_space+0xa0/0xa0 [] rds_tcp_data_ready+0xa1/0xf0 [rds_tcp] [] tcp_data_queue+0x379/0x5b0 [] ? rds_tcp_write_space+0xbb/0x110 [rds_tcp] [] tcp_rcv_established+0x2e2/0x6e0 [] tcp_v4_do_rcv+0x122/0x220 [] tcp_v4_rcv+0x867/0x880 [] ip_local_deliver_finish+0xa3/0x220 This happens because rds_send_xmit() chain wants to take sock_lock which is already taken by tcp_v4_rcv() on its way to rds_tcp_data_ready(). Commit db6526dcb51b ("RDS: use rds_send_xmit() state instead of RDS_LL_SEND_FULL") which was trying to opportunistically finish the send request in same thread context. But because of above recursive lock hang with RDS TCP, the send work from rds_send_pong() needs to deferred to worker to avoid lock up. Given RDS ping is more of connectivity test than performance critical path, its should be ok even for transport like IB. Reported-by: Sowmini Varadhan Acked-by: Sowmini Varadhan Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- v2: Dropped the confusing SEND_LL_FULL check from v1 net/rds/send.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/net/rds/send.c b/net/rds/send.c index ee49c25..827155c 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -1182,9 +1182,8 @@ rds_send_pong(struct rds_connection *conn, __be16 dport) rds_stats_inc(s_send_queued); rds_stats_inc(s_send_pong); - ret = rds_send_xmit(conn); - if (ret == -ENOMEM || ret == -EAGAIN) - queue_delayed_work(rds_wq, &conn->c_send_w, 1); + /* schedule the send work on rds_wq */ + queue_delayed_work(rds_wq, &conn->c_send_w, 1); rds_message_put(rm); return 0; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch net-next v5 3/8] switchdev: allow caller to explicitly request attr_set as deferred
On 15-10-16 01:23 AM, Jiri Pirko wrote: > Thu, Oct 15, 2015 at 05:21:22PM CEST, john.fastab...@gmail.com wrote: >> On 15-10-14 10:40 AM, Jiri Pirko wrote: >>> From: Jiri Pirko >>> >>> Caller should know if he can call attr_set directly (when holding RTNL) >>> or if he has to defer the att_set processing for later. >>> >>> This also allows drivers to sleep inside attr_set and report operation >>> status back to switchdev core. Switchdev core then warns if status is >>> not ok, instead of silent errors happening in drivers. >>> >>> Benefit from newly introduced switchdev deferred ops infrastructure. >>> >> >> A nit but the patch description should note your setting the defer bit >> on the bridge set state. >> >>> Signed-off-by: Jiri Pirko >>> --- >>> include/net/switchdev.h | 1 + >>> net/bridge/br_stp.c | 3 +- >>> net/switchdev/switchdev.c | 108 >>> ++ >>> 3 files changed, 46 insertions(+), 66 deletions(-) >>> >>> diff --git a/include/net/switchdev.h b/include/net/switchdev.h >>> index d1c7f90..f7de6f8 100644 >>> --- a/include/net/switchdev.h >>> +++ b/include/net/switchdev.h >>> @@ -17,6 +17,7 @@ >>> >>> #define SWITCHDEV_F_NO_RECURSE BIT(0) >>> #define SWITCHDEV_F_SKIP_EOPNOTSUPPBIT(1) >>> +#define SWITCHDEV_F_DEFER BIT(2) >>> >>> struct switchdev_trans_item { >>> struct list_head list; >>> diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c >>> index db6d243de..80c34d7 100644 >>> --- a/net/bridge/br_stp.c >>> +++ b/net/bridge/br_stp.c >>> @@ -41,13 +41,14 @@ void br_set_state(struct net_bridge_port *p, unsigned >>> int state) >>> { >>> struct switchdev_attr attr = { >>> .id = SWITCHDEV_ATTR_ID_PORT_STP_STATE, >>> + .flags = SWITCHDEV_F_DEFER, >>> .u.stp_state = state, >>> }; >> >> >> This creates a possible race (with 6/8) I think, please check! > > Wait. This patch does not change the previous behaviour. Patch 6 does, > so I don't understand why you are asking here. Confusing. > Sorry if its confusing I keyed of the addition of the SWITCHDEV_F_DEFER here. > >> >> In del_nbp() we call br_stp_disable_port() to set the port state >> to BR_STATE_DISABLE and disabling learning events. But with this >> patch it can be deferred. Also note the STP agent may be in userspace >> which actually seems more likely the case because you likely want to >> run some more modern variant of STP than the kernel supports. >> >> So at some point in the future the driver will turn off learning. At >> the same time we call br_fdb_delete_by_port which calls a deferred >> set of fdb deletes. >> >> I don't see how you guarantee learning is off before you start doing >> the deletes here and possibly learning new addresses after the software >> side believes the port is down. >> >> So >> >> br_stp_disable_port >> br_fdb_delete_by_port >> {fdb_del_external_learn} >> [hw learns a fdb] >> [hw disables learning] >> >> What stops this from happening? > > Okay. This behaviour is the same as without the patchset. What would > resolve the issue it to put switchdev_deferred_process() after > br_stp_disable_port() and before br_fdb_delete_by_port() call. > That would enforce stp change to happen in hw before fdbs are explicitly > deleted. Sound good to you? OK so putting the switchdev_deferred_process() between the disable_port and the delete_by_port will enforce the stp change to happen in hw before the fdbs are explicitly deleted. I think this is minimally required. I don't like scattering these flush_workqueue() calls all over the place but I don't have any better ideas right now so sounds good enough. But now I'm wondering if you can have a deferred fdb add in the rocker driver (rocker_port_fdb_learn_work) running in parallel with this that could happen after the delete and add a bogus fdb entry. I think you also need to have a flush in rocker_port_stp_update() to handle this case. Also I agree these issues were not completely caused by your patches. Thanks, John -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] RDS: fix rds-ping deadlock over TCP transport
On 10/16/2015 6:45 PM, Sowmini Varadhan wrote: On (10/16/15 20:26), Santosh Shilimkar wrote: diff --git a/net/rds/send.c b/net/rds/send.c + if (!test_bit(RDS_LL_SEND_FULL, &conn->c_flags)) + queue_delayed_work(rds_wq, &conn->c_send_w, 0); A minor note- it would help to add some comments here explaining that the pong has already been added to the sendq earlier.. in the case of IB, if RDS_LL_SEND_FULL has been set, it takes some head-scratching to figure out how the pong gets sent, and a few comments could help clarify that. Right. The check confused you. I will update the patch and drop the check all together and add a jiffies for the queuing which should take care of it. Will post v2 with that update. but other than that, the contents look good to me, thus Acked-by: Sowmini Varadhan Thanks !! regards, Santosh -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] RDS: fix rds-ping deadlock over TCP transport
On (10/16/15 20:26), Santosh Shilimkar wrote: > > diff --git a/net/rds/send.c b/net/rds/send.c > + if (!test_bit(RDS_LL_SEND_FULL, &conn->c_flags)) > + queue_delayed_work(rds_wq, &conn->c_send_w, 0); A minor note- it would help to add some comments here explaining that the pong has already been added to the sendq earlier.. in the case of IB, if RDS_LL_SEND_FULL has been set, it takes some head-scratching to figure out how the pong gets sent, and a few comments could help clarify that. but other than that, the contents look good to me, thus Acked-by: Sowmini Varadhan -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] RDS: fix rds-ping deadlock over TCP transport
Sowmini found hang with rds-ping while testing RDS over TCP. Its a corner case and doesn't happen always. The issue is not reproducible with IB transport. Its clear from below dump why we see it with RDS TCP. [] do_tcp_setsockopt+0xb5/0x740 [] tcp_setsockopt+0x24/0x30 [] sock_common_setsockopt+0x14/0x20 [] rds_tcp_xmit_prepare+0x5d/0x70 [rds_tcp] [] rds_send_xmit+0xd7/0x740 [rds] [] rds_send_pong+0x142/0x180 [rds] [] rds_recv_incoming+0x274/0x330 [rds] [] ? ttwu_queue+0x11e/0x130 [] ? skb_copy_bits+0x6d/0x2c0 [] rds_tcp_data_recv+0x2f0/0x3d0 [rds_tcp] [] tcp_read_sock+0x96/0x1c0 [] ? rds_tcp_recv_init+0x40/0x40 [rds_tcp] [] ? sock_def_write_space+0xa0/0xa0 [] rds_tcp_data_ready+0xa1/0xf0 [rds_tcp] [] tcp_data_queue+0x379/0x5b0 [] ? rds_tcp_write_space+0xbb/0x110 [rds_tcp] [] tcp_rcv_established+0x2e2/0x6e0 [] tcp_v4_do_rcv+0x122/0x220 [] tcp_v4_rcv+0x867/0x880 [] ip_local_deliver_finish+0xa3/0x220 This happens because rds_send_xmit() chain wants to take sock_lock which is already taken by tcp_v4_rcv() on its way to rds_tcp_data_ready(). Commit db6526dcb51b ("RDS: use rds_send_xmit() state instead of RDS_LL_SEND_FULL") which was trying to opportunistically finish the send request in same thread context. But because of above recursive lock hang with RDS TCP, the send work from rds_send_pong() needs to deferred to worker to avoid lock up. Given RDS ping is more of connectivity test than performance critical path, its should be ok even for transport like IB. Reported-by: Sowmini Varadhan Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/send.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/net/rds/send.c b/net/rds/send.c index ee49c25..7a377a1 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -1182,9 +1182,8 @@ rds_send_pong(struct rds_connection *conn, __be16 dport) rds_stats_inc(s_send_queued); rds_stats_inc(s_send_pong); - ret = rds_send_xmit(conn); - if (ret == -ENOMEM || ret == -EAGAIN) - queue_delayed_work(rds_wq, &conn->c_send_w, 1); + if (!test_bit(RDS_LL_SEND_FULL, &conn->c_flags)) + queue_delayed_work(rds_wq, &conn->c_send_w, 0); rds_message_put(rm); return 0; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/phy: micrel: Follow attached_dev to get to the MAC device
On Sat, Oct 17, 2015 at 02:32:30AM +0300, Sergei Shtylyov wrote: > On 10/16/2015 11:49 PM, Andrew Lunn wrote: > > >Commit "8b63ec1837fa phylib: Make PHYs children of their MDIO bus, not > >the bus' parent." broke finding PHY properties in the MAC device tree > >You probably forgot to run checkpatch.pl on this patch, else it > would have complained about the commit citing style. It's <12-bit > SHA1> (""). Actual, i did, and decided to ignore it. I'm quoting the regression report, which formats it that way. However i did deliberately use the correct format for the fixes: line, where it actually matters. checkpatch is just a guide, not a rigid rule. Andrew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
Alexei Starovoitov writes: > We can argue about api for 2nd, whether it's mount with fd=1234 string > or else, but for the first mount style doesn't make sense. Why does mount not make sense? It is exactly what you are looking for so why does it not make sense? Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/phy: micrel: Follow attached_dev to get to the MAC device
On Fri, Oct 16, 2015 at 05:58:41PM -0500, Dinh Nguyen wrote: > On Fri, 16 Oct 2015, Andrew Lunn wrote: > > > Commit "8b63ec1837fa phylib: Make PHYs children of their MDIO bus, not > > the bus' parent." broke finding PHY properties in the MAC device tree > > node. The parent device is now the MDIO bus, not the MAC. Use > > attached_dev towards the MAC device tree node. > > > > Fixes: 8b63ec1837fa ("phylib: Make PHYs children of their MDIO bus, not the > > bus' parent.") > > Signed-off-by: Andrew Lunn > > --- > > > > Compile tested only. > > > > Dinh, please could you test it and report back if it works or not. > > > > This patch did not seem to fix the problem. The following code did seem to > fix the problem: > > if (!of_node && dev->parent->of_node) > - of_node = dev->parent->of_node; > + do { > + of_node = dev->of_node; > + dev = dev->parent; > + i++; > + } while (!of_node && dev); This might fix the issue, but it has disadvantages. As i said before, it allows people to place phy properties into the mdio device node. We want to be reducing placing you can add phy properties, not adding more. Please could you try to debug why my patch did not work. Is attached_dev null? Andrew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] tunnels: Don't require remote endpoint or ID during creation.
Before lightweight tunnels existed, it really didn't make sense to create a tunnel that was not fully specified, such as without a destination IP address - the resulting packets would go nowhere. However, with lightweight tunnels, the opposite is true - it doesn't make sense to require this information when it will be provided later on by the route. This loosens the requirements for this information. An alternative would be to allow the relaxed version only when COLLECT_METADATA is enabled. However, since there are several variations on this theme (such as NBMA tunnels in GRE), just dropping the restrictions seems the most consistent across tunnels and with the existing configuration. CC: John Linville Signed-off-by: Jesse Gross --- drivers/net/geneve.c | 12 ++-- drivers/net/vxlan.c | 7 +++ 2 files changed, 9 insertions(+), 10 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 8f5c02e..cde29f8 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -870,14 +870,14 @@ static int geneve_newlink(struct net *net, struct net_device *dev, __be16 dst_port = htons(GENEVE_UDP_PORT); __u8 ttl = 0, tos = 0; bool metadata = false; - __be32 rem_addr; - __u32 vni; + __be32 rem_addr = 0; + __u32 vni = 0; - if (!data[IFLA_GENEVE_ID] || !data[IFLA_GENEVE_REMOTE]) - return -EINVAL; + if (data[IFLA_GENEVE_ID]) + vni = nla_get_u32(data[IFLA_GENEVE_ID]); - vni = nla_get_u32(data[IFLA_GENEVE_ID]); - rem_addr = nla_get_in_addr(data[IFLA_GENEVE_REMOTE]); + if (data[IFLA_GENEVE_REMOTE]) + rem_addr = nla_get_in_addr(data[IFLA_GENEVE_REMOTE]); if (data[IFLA_GENEVE_TTL]) ttl = nla_get_u8(data[IFLA_GENEVE_TTL]); diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index bbac1d3..afdc65f 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -2745,11 +2745,10 @@ static int vxlan_newlink(struct net *src_net, struct net_device *dev, struct vxlan_config conf; int err; - if (!data[IFLA_VXLAN_ID]) - return -EINVAL; - memset(&conf, 0, sizeof(conf)); - conf.vni = nla_get_u32(data[IFLA_VXLAN_ID]); + + if (data[IFLA_VXLAN_ID]) + conf.vni = nla_get_u32(data[IFLA_VXLAN_ID]); if (data[IFLA_VXLAN_GROUP]) { conf.remote_ip.sin.sin_addr.s_addr = nla_get_in_addr(data[IFLA_VXLAN_GROUP]); -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/phy: micrel: Follow attached_dev to get to the MAC device
On 10/16/2015 11:49 PM, Andrew Lunn wrote: Commit "8b63ec1837fa phylib: Make PHYs children of their MDIO bus, not the bus' parent." broke finding PHY properties in the MAC device tree You probably forgot to run checkpatch.pl on this patch, else it would have complained about the commit citing style. It's <12-bit SHA1> (""). node. The parent device is now the MDIO bus, not the MAC. Use attached_dev towards the MAC device tree node. Fixes: 8b63ec1837fa ("phylib: Make PHYs children of their MDIO bus, not the bus' parent.") Yeah, exactly like this. Signed-off-by: Andrew Lunn [...] MBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] ip-route man: add usage and description for lwtunnel encap attributes
On Thu, 15 Oct 2015 13:13:40 +0200 Thomas Graf wrote: > From: Roopa Prabhu > > This patch updates ip-route man page with lwtunnel encap > usage and description, covering MPLS and IP encapsulation. > > Signed-off-by: Roopa Prabhu > Signed-off-by: Thomas Graf I applied all 3 patches. The first one into master (since it might be useful for other code), and other two to net-next. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] uapi: add mpls_iptunnel.h
Add missing rule to export mpls iptunnel header needed by iproute2 Signed-off-by: Stephen Hemminger --- a/include/uapi/linux/Kbuild 2015-10-12 08:26:44.745317580 -0700 +++ b/include/uapi/linux/Kbuild 2015-10-16 16:09:39.362933576 -0700 @@ -263,6 +263,7 @@ header-y += minix_fs.h header-y += mman.h header-y += mmtimer.h header-y += mpls.h +header-y += mpls_iptunnel.h header-y += mqueue.h header-y += mroute6.h header-y += mroute.h -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/phy: micrel: Follow attached_dev to get to the MAC device
On Fri, 16 Oct 2015, Andrew Lunn wrote: > Commit "8b63ec1837fa phylib: Make PHYs children of their MDIO bus, not > the bus' parent." broke finding PHY properties in the MAC device tree > node. The parent device is now the MDIO bus, not the MAC. Use > attached_dev towards the MAC device tree node. > > Fixes: 8b63ec1837fa ("phylib: Make PHYs children of their MDIO bus, not the > bus' parent.") > Signed-off-by: Andrew Lunn > --- > > Compile tested only. > > Dinh, please could you test it and report back if it works or not. > This patch did not seem to fix the problem. The following code did seem to fix the problem: if (!of_node && dev->parent->of_node) - of_node = dev->parent->of_node; + do { + of_node = dev->of_node; + dev = dev->parent; + i++; + } while (!of_node && dev); BR, Dinh -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 iproute2] ip-address: fix oneline mode for interfaces with VF
On Fri, 16 Oct 2015 12:38:33 +0200 Phil Sutter wrote: > Signed-off-by: Phil Sutter > --- > Changed since v1: Fix commit message - typo in From: header confused > git-send-email. Ok, applied -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH iproute2] ip monitor neigh: Change 'delete' to 'Deleted' to be consistent with ip route
On Thu, 15 Oct 2015 11:47:43 -0700 Roopa Prabhu wrote: > From: Roopa Prabhu > > It helps to grep for one string "Deleted" when monitoring all events. > > Fixes: 6ea3ebafe077 ("iproute2: inform user when a neighbor is removed") > Signed-off-by: Roopa Prabhu > --- > I am not sure if it is too late for this change. But, sending this patch > out because it only affects ip monitor output > Sure applied. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 2/2] bpf: control all the perf events stored in PERF_EVENT_ARRAY maps
On 10/16/15 12:42 AM, Kaixu Xia wrote: This patch implements the function that controlling all the perf events stored in PERF_EVENT_ARRAY maps by setting the parameter 'index' to maps max_entries. Signed-off-by: Kaixu Xia --- kernel/trace/bpf_trace.c | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 3175600..4b385863 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -229,13 +229,30 @@ static u64 bpf_perf_event_dump_control(u64 r1, u64 index, u64 flag, u64 r4, u64 struct bpf_map *map = (struct bpf_map *) (unsigned long) r1; struct bpf_array *array = container_of(map, struct bpf_array, map); struct perf_event *event; + int i; - if (unlikely(index >= array->map.max_entries)) + if (unlikely(index > array->map.max_entries)) return -E2BIG; if (flag & BIT_FLAG_CHECK) return -EINVAL; + if (index == array->map.max_entries) { I don't like in-band signaling like this, since it's easy to make mistake on the bpf program side. Please use 2nd bit of 'flags' for that instead. Also squash this patch into 1st. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 1/2] bpf: control the trace data output on current cpu when perf sampling
On 10/16/15 12:42 AM, Kaixu Xia wrote: This patch adds the flag dump_enable to control the trace data output process when perf sampling. By setting this flag and integrating with ebpf, we can control the data output process and get the samples we are most interested in. The bpf helper bpf_perf_event_dump_control() can control the perf_event on current cpu. Signed-off-by: Kaixu Xia --- include/linux/perf_event.h | 1 + include/uapi/linux/bpf.h| 5 + include/uapi/linux/perf_event.h | 3 ++- kernel/bpf/verifier.c | 3 ++- kernel/events/core.c| 13 kernel/trace/bpf_trace.c| 44 + 6 files changed, 67 insertions(+), 2 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 092a0e8..2af527e 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -472,6 +472,7 @@ struct perf_event { struct irq_work pending; atomic_tevent_limit; + atomic_tdump_enable; The naming is the hardest... How about calling it 'soft_enable' instead? --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -287,6 +287,11 @@ enum bpf_func_id { * Return: realm if != 0 */ BPF_FUNC_get_route_realm, + + /** +* u64 bpf_perf_event_dump_control(&map, index, flag) +*/ + BPF_FUNC_perf_event_dump_control, and this one is too long. May be bpf_perf_event_control() ? Daniel, any thoughts on naming? --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -331,7 +331,8 @@ struct perf_event_attr { comm_exec : 1, /* flag comm events that are due to an exec */ use_clockid: 1, /* use @clockid for time fields */ context_switch : 1, /* context switch data */ - __reserved_1 : 37; + dump_enable: 1, /* don't output data on samples */ either comment or name is wrong. how about calling this one 'soft_disable', since you want zero to be default and the event should be on. diff --git a/kernel/events/core.c b/kernel/events/core.c index b11756f..74a16af 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event, irq_work_queue(&event->pending); } + if (!atomic_read(&event->dump_enable)) + return ret; I'm not an expert in this piece of perf, but should it be 'return 0' instead ? and may be moved to is_sampling_event() check? Also please add unlikely(). +static void perf_event_check_dump_flag(struct perf_event *event) +{ + if (event->attr.dump_enable == 1) + atomic_set(&event->dump_enable, 1); + else + atomic_set(&event->dump_enable, 0); that looks like it breaks perf, since default for bits is zero and all events will be soft-disabled? How did you test it? Please add a test to samples/bpf/ for this feature. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: hix5hd2_gmac: avoid integer overload warning
On Friday 16 October 2015 14:22:15 Joe Perches wrote: > On Fri, 2015-10-16 at 21:50 +0300, Sergei Shtylyov wrote: > > On 10/16/2015 09:04 PM, Joe Perches wrote: > > > > BITS_RX_EN is an 'unsigned long' constant, so the ones complement of > > that > > has bits set that do not fit into a 32-bit variable on 64-bit > > architectures, > > which causes a harmless gcc warning: > > >>> ... > > static void hix5hd2_port_disable(struct hix5hd2_priv *priv) > > { > > - writel_relaxed(~(BITS_RX_EN | BITS_TX_EN), priv->base + PORT_EN); > > + writel_relaxed(~(u32)(BITS_RX_EN | BITS_TX_EN), priv->base + > > PORT_EN); > > writel_relaxed(0, priv->base + DESC_WR_RD_ENA); > > >>> > > >>> ISTM that just means that the constants shouldn't be 'long'. > > >> > > >> Right, but that would probably mean changing the BIT() macro or not > > >> using it > > >> here. In the past I've argued against using that macro, but I've given > > >> up that fight. > > > > > > Fight on... (Somebody must have gone to USC here) Ok, I'll try: Please stop this nonsense! ;-) > > > There might be value in aefin BIT_U32 macro. > > > Maybe BIT_U64 too. > > > > There's BIT_ULL() already. > > I know, but symmetry is good. > I think there'd be no harm in adding it. > Perhaps adding all the sized variants would be useful. > > Something like: > > #define BIT_OF_TYPE(type, nr) \ > ({\ > typeof(type) rtn; \ > BUILD_BUG_ON(__builtin_constant_p(nr) &&\ >((nr) < 0 || \ > (nr) >= sizeof(type) * BITS_PER_BYTE)); \ > rtn = ((type)1) << (nr);\ > rtn;\ > }) > > #define BIT_U8(nr)BIT_OF_TYPE(u8, nr) > #define BIT_U16(nr) BIT_OF_TYPE(u16, nr) > #define BIT_U32(nr) BIT_OF_TYPE(u32, nr) > #define BIT_U64(nr) BIT_OF_TYPE(u64, nr) As I said, I'd rather see less uses of BIT() instead of more. While using 'BIT(23)' is often than the open-coded '1 << 23', I wish more people would write that as '0x0080' instead. It's easier to match with data sheets, and to compare to printk output, plus it's non-ambiguous if you are dealing with data sheets that use the IBM convention of counting the bits from the other end. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 1/2] drivers: net: xgene: Add support RGMII TX/RX delay configuration
On 16/10/15 13:35, Iyappan Subramanian wrote: > Add RGMII TX/RX delay configuration support. RGMII standard requires 2ns > delay to help the RGMII bridge receiver to sample data correctly. If the > default value does not provide proper centering of the data sample, the > TX/RX delay parameters can be used to adjust accordingly. There is a standard 'phy-mode' property which can take multiple values for RGMII: "rgmii" (no delay), "rgmii-txid" (transmit delay), "rgmii-rxid" (receive delay) and "rgmii-id" (symetric delay). There does not seem to be any verification of whether the rx or tx delay parameters you introduce are going to be either sensible, or not conflicting with the 'phy-mode' that should be configured. > > Signed-off-by: Iyappan Subramanian > --- > drivers/net/ethernet/apm/xgene/xgene_enet_hw.c | 8 +++- > drivers/net/ethernet/apm/xgene/xgene_enet_hw.h | 1 + > drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 49 > > drivers/net/ethernet/apm/xgene/xgene_enet_main.h | 2 + > 4 files changed, 59 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c > b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c > index 652f218..33850a0 100644 > --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c > +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c > @@ -461,6 +461,7 @@ static void xgene_gmac_reset(struct xgene_enet_pdata > *pdata) > > static void xgene_gmac_init(struct xgene_enet_pdata *pdata) > { > + struct device *dev = &pdata->pdev->dev; > u32 value, mc2; > u32 intf_ctl, rgmii; > u32 icm0, icm2; > @@ -490,7 +491,12 @@ static void xgene_gmac_init(struct xgene_enet_pdata > *pdata) > default: > ENET_INTERFACE_MODE2_SET(&mc2, 2); > intf_ctl |= ENET_GHD_MODE; > - CFG_TXCLK_MUXSEL0_SET(&rgmii, 4); > + > + if (dev->of_node) { > + CFG_TXCLK_MUXSEL0_SET(&rgmii, pdata->tx_delay); > + CFG_RXCLK_MUXSEL0_SET(&rgmii, pdata->rx_delay); > + } > + > xgene_enet_rd_csr(pdata, DEBUG_REG_ADDR, &value); > value |= CFG_BYPASS_UNISEC_TX | CFG_BYPASS_UNISEC_RX; > xgene_enet_wr_csr(pdata, DEBUG_REG_ADDR, value); > diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h > b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h > index ff05bbc..6dee73c 100644 > --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h > +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h > @@ -144,6 +144,7 @@ enum xgene_enet_rm { > #define CFG_BYPASS_UNISEC_RX BIT(1) > #define CFG_CLE_BYPASS_EN0 BIT(31) > #define CFG_TXCLK_MUXSEL0_SET(dst, val) xgene_set_bits(dst, val, 29, 3) > +#define CFG_RXCLK_MUXSEL0_SET(dst, val) xgene_set_bits(dst, val, 26, 3) > > #define CFG_CLE_IP_PROTOCOL0_SET(dst, val) xgene_set_bits(dst, val, 16, 2) > #define CFG_CLE_DSTQID0_SET(dst, val)xgene_set_bits(dst, > val, 0, 12) > diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c > b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c > index 6b1846d..ce10687 100644 > --- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c > +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c > @@ -1118,6 +1118,47 @@ static int xgene_get_port_id_dt(struct device *dev, > struct xgene_enet_pdata *pda > return ret; > } > > +static int xgene_get_tx_delay(struct xgene_enet_pdata *pdata) > +{ > + struct device *dev = &pdata->pdev->dev; > + int delay, ret; > + > + ret = of_property_read_u32(dev->of_node, "tx-delay", &delay); > + if (ret) { > + pdata->tx_delay = 4; > + return 0; > + } > + > + if (delay < 0 || delay > 7) { > + dev_err(dev, "Invalid tx-delay specified\n"); > + return -EINVAL; > + } > + > + pdata->tx_delay = delay; > + > + return 0; > +} > + > +static int xgene_get_rx_delay(struct xgene_enet_pdata *pdata) > +{ > + struct device *dev = &pdata->pdev->dev; > + int delay, ret; > + > + ret = of_property_read_u32(dev->of_node, "rx-delay", &delay); > + if (ret) { > + pdata->rx_delay = 2; > + return 0; > + } > + > + if (delay < 0 || delay > 7) { > + dev_err(dev, "Invalid rx-delay specified\n"); > + return -EINVAL; > + } > + > + pdata->rx_delay = delay; > + > + return 0; > +} > > static int xgene_enet_get_resources(struct xgene_enet_pdata *pdata) > { > @@ -1194,6 +1235,14 @@ static int xgene_enet_get_resources(struct > xgene_enet_pdata *pdata) > return -ENODEV; > } > > + ret = xgene_get_tx_delay(pdata); > + if (ret) > + return ret; > + > + ret = xgene_get_rx_delay(pdata); > + if (ret) > + return ret; > + > ret = platform_get_irq(pdev, 0); > if (ret <= 0) { > dev_err(dev, "Unable to get ENET Rx IRQ\n"); > diff --git a/d
Re: [PATCH] net: hix5hd2_gmac: avoid integer overload warning
On Fri, 2015-10-16 at 21:50 +0300, Sergei Shtylyov wrote: > On 10/16/2015 09:04 PM, Joe Perches wrote: > > BITS_RX_EN is an 'unsigned long' constant, so the ones complement of that > has bits set that do not fit into a 32-bit variable on 64-bit > architectures, > which causes a harmless gcc warning: > >>> ... > static void hix5hd2_port_disable(struct hix5hd2_priv *priv) > { > - writel_relaxed(~(BITS_RX_EN | BITS_TX_EN), priv->base + PORT_EN); > + writel_relaxed(~(u32)(BITS_RX_EN | BITS_TX_EN), priv->base + > PORT_EN); > writel_relaxed(0, priv->base + DESC_WR_RD_ENA); > >>> > >>> ISTM that just means that the constants shouldn't be 'long'. > >> > >> Right, but that would probably mean changing the BIT() macro or not using > >> it > >> here. In the past I've argued against using that macro, but I've given > >> up that fight. > > > > Fight on... (Somebody must have gone to USC here) > > > > There might be value in aefin BIT_U32 macro. > > Maybe BIT_U64 too. > > There's BIT_ULL() already. I know, but symmetry is good. I think there'd be no harm in adding it. Perhaps adding all the sized variants would be useful. Something like: #define BIT_OF_TYPE(type, nr) \ ({ \ typeof(type) rtn; \ BUILD_BUG_ON(__builtin_constant_p(nr) &&\ ((nr) < 0 || \ (nr) >= sizeof(type) * BITS_PER_BYTE)); \ rtn = ((type)1) << (nr);\ rtn;\ }) #define BIT_U8(nr) BIT_OF_TYPE(u8, nr) #define BIT_U16(nr) BIT_OF_TYPE(u16, nr) #define BIT_U32(nr) BIT_OF_TYPE(u32, nr) #define BIT_U64(nr) BIT_OF_TYPE(u64, nr) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
On 10/16/15 12:53 PM, Eric W. Biederman wrote: Alexei Starovoitov writes: On 10/16/15 11:41 AM, Eric W. Biederman wrote: [...] I am missing something. When I suggested using a filesystem it was my thought there would be exactly one superblock per map, and the map would be specified at mount time. You clearly are not implementing that. I don't think it's practical to have sb per map, since that would mean sb per prog and that won't scale. What do you mean won't scale? You want to have a name per map/prog so the basic complexity appears the same. Is there some crucial interaction between the persistent dodads you are placing on a filesystem that I am missing? Given the fact you don't normally need any persistence without a program I am puzzled why "scaling" is an issue of any kind. This is for a comparitively rare case if I am not mistaken. representing map as a directory tree with files as keys is indeed 'rare' since it's mainly for debugging and slow accesses, but 'pin_fd' functionality now popping up everywhere. Mainly because in things like openstack there are tons of disjoint libraries written in different languages and the only thing common is kernel. So pin_fd/new_fd is a mandatory feature. Also map today is an fd that belongs to a process. I cannot see an api from C program to do 'mount of FD' that wouldn't look like ugly hack. mount -t bpffs ... -o fd=1234 That is not all convoluted or hacky. Especially compared to some of the alternatives I am seeing. It is no problem at all to wrap something like that in a nice function call that has the exact same complexity of use as any of the other options that are being explored to give something that starts out as a filedescriptor a name. Frankly, I don't think parsing 'fd=1234' string is a clean api, but before we argue about fs philosophy of passing options, let's get on the same page with requirements. First goal that this patch is solving is providing an ability to 'pin' an FD, so that map/prog won't disappear when user app exist. Second goal of future patches is to expose map internals as a directory structure. These two goals are independent. We can argue about api for 2nd, whether it's mount with fd=1234 string or else, but for the first mount style doesn't make sense. A filesystem per map makes sense as you have a key-value store with one file per key. The idea is that something resembling your bpf_pin_fd function would be the mount system call for the filesystem. The the keys in the map could be read by "ls /mountpoint/". Key values could be inspected with "cat /mountpoint/key". yes. that is still the goal for follow up patches, but contained within given bpffs. Something bpf_pin_fd-like command for bpf syscall would create files for keys in a map and allow 'cat' via open/read. Such api would be much cleaner from C app point of view. Potentially we can allow mount of a file created via BPF_PIN_FD that will expand into keys/values. All of that are our future plans. There, actually, the main contention point is 'how to represent keys and values'. whether key is hex representation or we need some pretty-printers via format string or via schema? etc, etc. We tried few ideas of representing keys in our fuse implementations, but don't have an agreement yet. My gut feel would be to keep it simple and use the same representation you use in your existing system calls. Certainly ordinary filenames are keys of arbitrary binary data that can included everything except a '\0' byte. That they are human readable is a nice convention, but not at all fundamental to what they are. that doesn't work. map keys are never human readable. they're arbitrary binary data. That's why representing them as file name is not trivial. Some pretty-printer is needed. Again that is 2nd goal of bpffs in general. We cannot really solve it now, because we cannot say 'lets represent keys like X and work from there', since that will become kernel ABI and we won't be able to change that. It's equally not clear that thousands of keys can even work as files. So quite a bit of brainstorming still to do for this 2nd goal. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net/phy: micrel: Follow attached_dev to get to the MAC device
Commit "8b63ec1837fa phylib: Make PHYs children of their MDIO bus, not the bus' parent." broke finding PHY properties in the MAC device tree node. The parent device is now the MDIO bus, not the MAC. Use attached_dev towards the MAC device tree node. Fixes: 8b63ec1837fa ("phylib: Make PHYs children of their MDIO bus, not the bus' parent.") Signed-off-by: Andrew Lunn --- Compile tested only. Dinh, please could you test it and report back if it works or not. drivers/net/phy/micrel.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c index 499185eaf413..9e2083a1a5a9 100644 --- a/drivers/net/phy/micrel.c +++ b/drivers/net/phy/micrel.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -340,8 +341,9 @@ static int ksz9021_config_init(struct phy_device *phydev) const struct device *dev = &phydev->dev; const struct device_node *of_node = dev->of_node; - if (!of_node && dev->parent->of_node) - of_node = dev->parent->of_node; + if (!of_node && phydev->attached_dev && + phydev->attached_dev->dev.of_node) + of_node = phydev->attached_dev->dev.of_node; if (of_node) { ksz9021_load_values_from_of(phydev, of_node, -- 2.5.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 1/2] drivers: net: xgene: Add support RGMII TX/RX delay configuration
Add RGMII TX/RX delay configuration support. RGMII standard requires 2ns delay to help the RGMII bridge receiver to sample data correctly. If the default value does not provide proper centering of the data sample, the TX/RX delay parameters can be used to adjust accordingly. Signed-off-by: Iyappan Subramanian --- drivers/net/ethernet/apm/xgene/xgene_enet_hw.c | 8 +++- drivers/net/ethernet/apm/xgene/xgene_enet_hw.h | 1 + drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 49 drivers/net/ethernet/apm/xgene/xgene_enet_main.h | 2 + 4 files changed, 59 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c index 652f218..33850a0 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.c @@ -461,6 +461,7 @@ static void xgene_gmac_reset(struct xgene_enet_pdata *pdata) static void xgene_gmac_init(struct xgene_enet_pdata *pdata) { + struct device *dev = &pdata->pdev->dev; u32 value, mc2; u32 intf_ctl, rgmii; u32 icm0, icm2; @@ -490,7 +491,12 @@ static void xgene_gmac_init(struct xgene_enet_pdata *pdata) default: ENET_INTERFACE_MODE2_SET(&mc2, 2); intf_ctl |= ENET_GHD_MODE; - CFG_TXCLK_MUXSEL0_SET(&rgmii, 4); + + if (dev->of_node) { + CFG_TXCLK_MUXSEL0_SET(&rgmii, pdata->tx_delay); + CFG_RXCLK_MUXSEL0_SET(&rgmii, pdata->rx_delay); + } + xgene_enet_rd_csr(pdata, DEBUG_REG_ADDR, &value); value |= CFG_BYPASS_UNISEC_TX | CFG_BYPASS_UNISEC_RX; xgene_enet_wr_csr(pdata, DEBUG_REG_ADDR, value); diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h index ff05bbc..6dee73c 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_hw.h @@ -144,6 +144,7 @@ enum xgene_enet_rm { #define CFG_BYPASS_UNISEC_RX BIT(1) #define CFG_CLE_BYPASS_EN0 BIT(31) #define CFG_TXCLK_MUXSEL0_SET(dst, val)xgene_set_bits(dst, val, 29, 3) +#define CFG_RXCLK_MUXSEL0_SET(dst, val)xgene_set_bits(dst, val, 26, 3) #define CFG_CLE_IP_PROTOCOL0_SET(dst, val) xgene_set_bits(dst, val, 16, 2) #define CFG_CLE_DSTQID0_SET(dst, val) xgene_set_bits(dst, val, 0, 12) diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c index 6b1846d..ce10687 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c @@ -1118,6 +1118,47 @@ static int xgene_get_port_id_dt(struct device *dev, struct xgene_enet_pdata *pda return ret; } +static int xgene_get_tx_delay(struct xgene_enet_pdata *pdata) +{ + struct device *dev = &pdata->pdev->dev; + int delay, ret; + + ret = of_property_read_u32(dev->of_node, "tx-delay", &delay); + if (ret) { + pdata->tx_delay = 4; + return 0; + } + + if (delay < 0 || delay > 7) { + dev_err(dev, "Invalid tx-delay specified\n"); + return -EINVAL; + } + + pdata->tx_delay = delay; + + return 0; +} + +static int xgene_get_rx_delay(struct xgene_enet_pdata *pdata) +{ + struct device *dev = &pdata->pdev->dev; + int delay, ret; + + ret = of_property_read_u32(dev->of_node, "rx-delay", &delay); + if (ret) { + pdata->rx_delay = 2; + return 0; + } + + if (delay < 0 || delay > 7) { + dev_err(dev, "Invalid rx-delay specified\n"); + return -EINVAL; + } + + pdata->rx_delay = delay; + + return 0; +} static int xgene_enet_get_resources(struct xgene_enet_pdata *pdata) { @@ -1194,6 +1235,14 @@ static int xgene_enet_get_resources(struct xgene_enet_pdata *pdata) return -ENODEV; } + ret = xgene_get_tx_delay(pdata); + if (ret) + return ret; + + ret = xgene_get_rx_delay(pdata); + if (ret) + return ret; + ret = platform_get_irq(pdev, 0); if (ret <= 0) { dev_err(dev, "Unable to get ENET Rx IRQ\n"); diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.h b/drivers/net/ethernet/apm/xgene/xgene_enet_main.h index ff89a5d..a6e56b8 100644 --- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.h +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.h @@ -184,6 +184,8 @@ struct xgene_enet_pdata { u8 bp_bufnum; u16 ring_num; u32 mss; + u8 tx_delay; + u8 rx_delay; }; struct xgene_indirect_ctl { -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://v
[PATCH net-next 0/2] drivers: xgene: Add support RGMII TX/RX delay configuration
This patch adds support RGMII TX/RX delay configuration. Signed-off-by: Iyappan Subramanian --- Iyappan Subramanian (2): drivers: net: xgene: Add support RGMII TX/RX delay configuration Documentation: dts: xgene: Add TX/RX delay field .../devicetree/bindings/net/apm-xgene-enet.txt | 10 + drivers/net/ethernet/apm/xgene/xgene_enet_hw.c | 8 +++- drivers/net/ethernet/apm/xgene/xgene_enet_hw.h | 1 + drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 49 ++ drivers/net/ethernet/apm/xgene/xgene_enet_main.h | 2 + 5 files changed, 69 insertions(+), 1 deletion(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 2/2] Documentation: dts: xgene: Add TX/RX delay field
Signed-off-by: Iyappan Subramanian --- Documentation/devicetree/bindings/net/apm-xgene-enet.txt | 10 ++ 1 file changed, 10 insertions(+) diff --git a/Documentation/devicetree/bindings/net/apm-xgene-enet.txt b/Documentation/devicetree/bindings/net/apm-xgene-enet.txt index f55aa28..5b17c13 100644 --- a/Documentation/devicetree/bindings/net/apm-xgene-enet.txt +++ b/Documentation/devicetree/bindings/net/apm-xgene-enet.txt @@ -37,6 +37,14 @@ Required properties for ethernet interfaces that have external PHY: Optional properties: - status: Should be "ok" or "disabled" for enabled/disabled. Default is "ok". +- tx-delay: Delay value for RGMII TXD timming. + Valid values are between 0 to 7, that maps to + 417, 717, 1020, 1321, 1611, 1913, 2215, 2514 ps + Default value is 4, which corresponds to 1611 ps +- rx-delay: Delay value for RGMII RXD timming in ps. + Valid values are between 0 to 7, that maps to + 273, 589, 899, 1222, 1480, 1806, 2147, 2464 ps + Default value is 2, which corresponds to 899 ps Example: menetclk: menetclk { @@ -72,5 +80,7 @@ Example: /* Board-specific peripheral configurations */ &menet { + tx-delay = <4>; + rx-delay = <2>; status = "ok"; }; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SoCFPGA ethernet broken
On Fri, Oct 16, 2015 at 01:24:16PM -0700, David Daney wrote: > On 10/16/2015 12:38 PM, Andrew Lunn wrote: > >>>Maybe we need to walk up the hierarchy. > >>> > >>>Perhaps something like: > >>> > >>>const struct device *dev_walker; > >>> > >>>dev_walker = &phydev->dev; > >>>do { > >>>of_node = dev_walker->of_node; > >>>dev_walker = dev_walker->parent; > >>>} while (!of_node && dev_walker); > >>> > >> > >>The above code seems to have fixed the issue. > > > >What i don't like about this is that it allows you to put these > >properties in the mdio device node. These are phy properties, not mdio > >properties > > Yes, we know that the device tree is not correctly formed. > > You have a choice: > > A) Fix the device tree and any code that may have to change to work > with a good device tree. > > B) Change the code to work with the screwy existing device tree. > The above seems to work, other things may also be possible. > > I can't really make any decisions as to what the best way to proceed > is, as I neither have the hardware in question, nor the time to work > on it. Hi hope Dinh will test with phydev->attached_dev->dev->of_node. Andrew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SoCFPGA ethernet broken
On 10/16/2015 12:38 PM, Andrew Lunn wrote: Maybe we need to walk up the hierarchy. Perhaps something like: const struct device *dev_walker; dev_walker = &phydev->dev; do { of_node = dev_walker->of_node; dev_walker = dev_walker->parent; } while (!of_node && dev_walker); The above code seems to have fixed the issue. What i don't like about this is that it allows you to put these properties in the mdio device node. These are phy properties, not mdio properties Yes, we know that the device tree is not correctly formed. You have a choice: A) Fix the device tree and any code that may have to change to work with a good device tree. B) Change the code to work with the screwy existing device tree. The above seems to work, other things may also be possible. I can't really make any decisions as to what the best way to proceed is, as I neither have the hardware in question, nor the time to work on it. David Daney If phydev->attached_dev->dev->of_node works, that would be my preference. Andrew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
Alexei Starovoitov writes: > On 10/16/15 11:41 AM, Eric W. Biederman wrote: [...] >> I am missing something. >> >> When I suggested using a filesystem it was my thought there would be >> exactly one superblock per map, and the map would be specified at mount >> time. You clearly are not implementing that. > > I don't think it's practical to have sb per map, since that would mean > sb per prog and that won't scale. What do you mean won't scale? You want to have a name per map/prog so the basic complexity appears the same. Is there some crucial interaction between the persistent dodads you are placing on a filesystem that I am missing? Given the fact you don't normally need any persistence without a program I am puzzled why "scaling" is an issue of any kind. This is for a comparitively rare case if I am not mistaken. > Also map today is an fd that belongs to a process. I cannot see > an api from C program to do 'mount of FD' that wouldn't look like > ugly hack. mount -t bpffs ... -o fd=1234 That is not all convoluted or hacky. Especially compared to some of the alternatives I am seeing. It is no problem at all to wrap something like that in a nice function call that has the exact same complexity of use as any of the other options that are being explored to give something that starts out as a filedescriptor a name. >> A filesystem per map makes sense as you have a key-value store with one >> file per key. >> >> The idea is that something resembling your bpf_pin_fd function would be >> the mount system call for the filesystem. >> >> The the keys in the map could be read by "ls /mountpoint/". >> Key values could be inspected with "cat /mountpoint/key". > > yes. that is still the goal for follow up patches, but contained > within given bpffs. Something bpf_pin_fd-like command for bpf syscall > would create files for keys in a map and allow 'cat' via open/read. > Such api would be much cleaner from C app point of view. > Potentially we can allow mount of a file created via BPF_PIN_FD > that will expand into keys/values. > All of that are our future plans. > There, actually, the main contention point is 'how to represent keys > and values'. whether key is hex representation or we need some > pretty-printers via format string or via schema? etc, etc. > We tried few ideas of representing keys in our fuse implementations, > but don't have an agreement yet. My gut feel would be to keep it simple and use the same representation you use in your existing system calls. Certainly ordinary filenames are keys of arbitrary binary data that can included everything except a '\0' byte. That they are human readable is a nice convention, but not at all fundamental to what they are. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] tcp: do not set queue_mapping on SYNACK
From: Eric Dumazet At the time of commit fff326990789 ("tcp: reflect SYN queue_mapping into SYNACK packets") we had little ways to cope with SYN floods. We no longer need to reflect incoming skb queue mappings, and instead can pick a TX queue based on cpu cooking the SYNACK, with normal XPS affinities. Note that all SYNACK retransmits were picking TX queue 0, this no longer is a win given that SYNACK rtx are now distributed on all cpus. Signed-off-by: Eric Dumazet --- include/net/tcp.h |2 +- net/ipv4/ip_output.c |1 - net/ipv4/tcp_input.c |4 ++-- net/ipv4/tcp_ipv4.c |2 -- net/ipv4/tcp_output.c |2 +- net/ipv6/tcp_ipv6.c |2 -- 6 files changed, 4 insertions(+), 9 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index a6be56d5f0e3..eed94fc355c1 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1716,7 +1716,7 @@ struct tcp_request_sock_ops { __u32 (*init_seq)(const struct sk_buff *skb); int (*send_synack)(const struct sock *sk, struct dst_entry *dst, struct flowi *fl, struct request_sock *req, - u16 queue_mapping, struct tcp_fastopen_cookie *foc, + struct tcp_fastopen_cookie *foc, bool attach_req); }; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 67404e1fe7d4..50e29737b584 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1596,7 +1596,6 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, arg->csumoffset) = csum_fold(csum_add(nskb->csum, arg->csum)); nskb->ip_summed = CHECKSUM_NONE; - skb_set_queue_mapping(nskb, skb_get_queue_mapping(skb)); ip_push_pending_frames(sk, &fl4); } out: diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 3b35c3f4d268..944eaca69115 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -6236,7 +6236,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, } if (fastopen_sk) { af_ops->send_synack(fastopen_sk, dst, &fl, req, - skb_get_queue_mapping(skb), &foc, false); + &foc, false); /* Add the child socket directly into the accept queue */ inet_csk_reqsk_queue_add(sk, req, fastopen_sk); sk->sk_data_ready(sk); @@ -6247,7 +6247,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, if (!want_cookie) inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT); af_ops->send_synack(sk, dst, &fl, req, - skb_get_queue_mapping(skb), &foc, !want_cookie); + &foc, !want_cookie); if (want_cookie) goto drop_and_free; } diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 9c68cf3762c4..30dd45c1f568 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -821,7 +821,6 @@ static void tcp_v4_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb, static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst, struct flowi *fl, struct request_sock *req, - u16 queue_mapping, struct tcp_fastopen_cookie *foc, bool attach_req) { @@ -839,7 +838,6 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst, if (skb) { __tcp_v4_send_check(skb, ireq->ir_loc_addr, ireq->ir_rmt_addr); - skb_set_queue_mapping(skb, queue_mapping); err = ip_build_and_send_pkt(skb, sk, ireq->ir_loc_addr, ireq->ir_rmt_addr, ireq->opt); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 6e79fcb0addb..19adedb8c5cc 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3518,7 +3518,7 @@ int tcp_rtx_synack(const struct sock *sk, struct request_sock *req) int res; tcp_rsk(req)->txhash = net_tx_rndhash(); - res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL, true); + res = af_ops->send_synack(sk, NULL, &fl, req, NULL, true); if (!res) { TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS); NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS); diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index acb06f86f372..f495d189f5e0 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -437,7 +437,6 @@ out: static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst, struct flowi *fl, struct request_so
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
On 10/16/2015 09:27 PM, Alexei Starovoitov wrote: On 10/16/15 11:41 AM, Eric W. Biederman wrote: Daniel Borkmann writes: On 10/16/2015 07:42 PM, Alexei Starovoitov wrote: On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote: Another question: Should multiple mount of the filesystem result in an empty fs (a new instance) or in one were one can see other ebpf-fs entities? I think Daniel wanted to already use the mountpoint as some kind of hierarchy delimiter. I would have used directories for that and multiple mounts would then have resulted in the same content of the filesystem. IMHO this would remove some ambiguity but then the question arises how this is handled in a namespaced environment. Was there some specific reason to do so? That's an interesting question! I think all mounts should be independent. I can see tracing using one and networking using another one with different hierarchies suitable for their own use cases. What's an advantage to have the same content everywhere? Feels harder to manage, since different users would need to coordinate. I initially had it as a mount_single() file system, where I was thinking to have an entry under /sys/fs/bpf/, so all subsystems would work on top of that mount point, but for the same reasons above I lifted that restriction. I am missing something. When I suggested using a filesystem it was my thought there would be exactly one superblock per map, and the map would be specified at mount time. You clearly are not implementing that. I don't think it's practical to have sb per map, since that would mean sb per prog and that won't scale. Also map today is an fd that belongs to a process. I cannot see an api from C program to do 'mount of FD' that wouldn't look like ugly hack. A filesystem per map makes sense as you have a key-value store with one file per key. The idea is that something resembling your bpf_pin_fd function would be the mount system call for the filesystem. The the keys in the map could be read by "ls /mountpoint/". Key values could be inspected with "cat /mountpoint/key". yes. that is still the goal for follow up patches, but contained within given bpffs. Something bpf_pin_fd-like command for bpf syscall would create files for keys in a map and allow 'cat' via open/read. Such api would be much cleaner from C app point of view. Potentially we can allow mount of a file created via BPF_PIN_FD that will expand into keys/values. Yeah, sort of making this an optional debugging facility if anything (maybe to just get a read-only snapshot view). Having maps with a very large number of entries might end up being problematic by its own, or mapping potential future map candidates such as rhashtable. There, actually, the main contention point is 'how to represent keys and values'. whether key is hex representation or we need some pretty-printers via format string or via schema? etc, etc. We tried few ideas of representing keys in our fuse implementations, but don't have an agreement yet. That is unclear as well to make it useful. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SoCFPGA ethernet broken
> > Maybe we need to walk up the hierarchy. > > > > Perhaps something like: > > > > const struct device *dev_walker; > > > > dev_walker = &phydev->dev; > > do { > >of_node = dev_walker->of_node; > >dev_walker = dev_walker->parent; > > } while (!of_node && dev_walker); > > > > The above code seems to have fixed the issue. What i don't like about this is that it allows you to put these properties in the mdio device node. These are phy properties, not mdio properties If phydev->attached_dev->dev->of_node works, that would be my preference. Andrew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
On 10/16/15 11:41 AM, Eric W. Biederman wrote: Daniel Borkmann writes: On 10/16/2015 07:42 PM, Alexei Starovoitov wrote: On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote: Another question: Should multiple mount of the filesystem result in an empty fs (a new instance) or in one were one can see other ebpf-fs entities? I think Daniel wanted to already use the mountpoint as some kind of hierarchy delimiter. I would have used directories for that and multiple mounts would then have resulted in the same content of the filesystem. IMHO this would remove some ambiguity but then the question arises how this is handled in a namespaced environment. Was there some specific reason to do so? That's an interesting question! I think all mounts should be independent. I can see tracing using one and networking using another one with different hierarchies suitable for their own use cases. What's an advantage to have the same content everywhere? Feels harder to manage, since different users would need to coordinate. I initially had it as a mount_single() file system, where I was thinking to have an entry under /sys/fs/bpf/, so all subsystems would work on top of that mount point, but for the same reasons above I lifted that restriction. I am missing something. When I suggested using a filesystem it was my thought there would be exactly one superblock per map, and the map would be specified at mount time. You clearly are not implementing that. I don't think it's practical to have sb per map, since that would mean sb per prog and that won't scale. Also map today is an fd that belongs to a process. I cannot see an api from C program to do 'mount of FD' that wouldn't look like ugly hack. A filesystem per map makes sense as you have a key-value store with one file per key. The idea is that something resembling your bpf_pin_fd function would be the mount system call for the filesystem. The the keys in the map could be read by "ls /mountpoint/". Key values could be inspected with "cat /mountpoint/key". yes. that is still the goal for follow up patches, but contained within given bpffs. Something bpf_pin_fd-like command for bpf syscall would create files for keys in a map and allow 'cat' via open/read. Such api would be much cleaner from C app point of view. Potentially we can allow mount of a file created via BPF_PIN_FD that will expand into keys/values. All of that are our future plans. There, actually, the main contention point is 'how to represent keys and values'. whether key is hex representation or we need some pretty-printers via format string or via schema? etc, etc. We tried few ideas of representing keys in our fuse implementations, but don't have an agreement yet. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net v2] openvswitch: Allocate memory for ovs internal device stats.
On Fri, Oct 16, 2015 at 4:07 AM, James Morse wrote: > "openvswitch: Remove vport stats" removed the per-vport statistics, in > order to use the netdev's statistics fields. > "openvswitch: Fix ovs_vport_get_stats()" fixed the export of these stats > to user-space, by using the provided netdev_ops to collate them - but ovs > internal devices still use an unallocated dev->tstats field to count > packets, which are no longer exported by this api. > > Allocate the dev->tstats field for ovs internal devices, and wire up > ndo_get_stats64 with the original implementation of > ovs_vport_get_stats(). > > On its own, "openvswitch: Fix ovs_vport_get_stats()" fixes the OOPs, > unmasking a full-on panic on arm64: > =%<== > [] internal_dev_recv+0xa8/0x170 [openvswitch] > [] do_output.isra.31+0x60/0x19c [openvswitch] > [] do_execute_actions+0x208/0x11c0 [openvswitch] > [] ovs_execute_actions+0xc8/0x238 [openvswitch] > [] ovs_packet_cmd_execute+0x21c/0x288 [openvswitch] > [] genl_family_rcv_msg+0x1b0/0x310 > [] genl_rcv_msg+0xa4/0xe4 > [] netlink_rcv_skb+0xb0/0xdc > [] genl_rcv+0x38/0x50 > [] netlink_unicast+0x164/0x210 > [] netlink_sendmsg+0x304/0x368 > [] sock_sendmsg+0x30/0x4c > [SNIP] > Kernel panic - not syncing: Fatal exception in interrupt > =%<== > > Fixes: 8c876639c985 ("openvswitch: Remove vport stats.") > Signed-off-by: James Morse > > --- > Hi netdev, > > "openvswitch: Fix ovs_vport_get_stats()" [0], is an incomplete fix for the > issue in "openvswitch: Remove vport stats.". Use of an unallocated > dev->tstats remains for ovs internal devices, which causes a panic on arm64. > > Change since v1: > * Removed superfluous rx/tx_dropped increment, already done by > dev_get_stats(). > > Thanks! > > James Morse > > [0] https://patchwork.ozlabs.org/patch/525827/ > > net/openvswitch/vport-internal_dev.c | 40 > +++- > 1 file changed, 39 insertions(+), 1 deletion(-) > > diff --git a/net/openvswitch/vport-internal_dev.c > b/net/openvswitch/vport-internal_dev.c > index 388b8a6bf112..957fe715b544 100644 > --- a/net/openvswitch/vport-internal_dev.c > +++ b/net/openvswitch/vport-internal_dev.c > @@ -106,12 +106,45 @@ static void internal_dev_destructor(struct net_device > *dev) > free_netdev(dev); > } > ... > static const struct net_device_ops internal_dev_netdev_ops = { > .ndo_open = internal_dev_open, > .ndo_stop = internal_dev_stop, > .ndo_start_xmit = internal_dev_xmit, > .ndo_set_mac_address = eth_mac_addr, > .ndo_change_mtu = internal_dev_change_mtu, > + .ndo_get_stats64 = internal_get_stats, > }; > > static struct rtnl_link_ops internal_dev_link_ops __read_mostly = { > @@ -161,6 +194,11 @@ static struct vport *internal_dev_create(const struct > vport_parms *parms) > err = -ENOMEM; > goto error_free_vport; > } > + vport->dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats); > + if (!vport->dev->tstats) { > + err = -ENOMEM; > + goto error_free_vport; > + } > Exception handling code is not correct, needs to free netdev here. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SoCFPGA ethernet broken
On Fri, 16 Oct 2015, David Daney wrote: > On 10/16/2015 08:56 AM, Andrew Lunn wrote: > > > So I think I'll move to inspect what Florian had suggested, and that was > > > to look > > > at: > > > drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c::stmmac_mdio_register > > > > I have a suspicion. If you look at the phy driver it does: > > > > static int ksz9021_config_init(struct phy_device *phydev) > > { > > const struct device *dev = &phydev->dev; > > const struct device_node *of_node = dev->of_node; > > > > if (!of_node && dev->parent->of_node) > > of_node = dev->parent->of_node; > > > > Maybe we need to walk up the hierarchy. > > Perhaps something like: > > const struct device *dev_walker; > > dev_walker = &phydev->dev; > do { >of_node = dev_walker->of_node; >dev_walker = dev_walker->parent; > } while (!of_node && dev_walker); > The above code seems to have fixed the issue. > An alternative would be to assign the bus the same of_node as the bus parent. > > If either approach works, you can add: > Acked-by: David Daney > > to the patch that implements it. > > > > > In your case, you don't have a phy node in your device tree, so of_node > > is NULL. So it looks in the parent device. > > > > phylib: Make PHYs children of their MDIO bus, not the bus' parent. > > > > changed what the parent is. It is now the mdio device. Before, i > > suspect it was the MAC. Hence it found your properties in the MAC > > node. > > > > What i think you might want to do is change this code. Rather than > > look a dev->parent->of_node; you might want > > phydev->attached_dev->dev->of_node. > > > > This assumes the phy has been attached to the MAC. I've no idea of the > > ordering, so maybe it has not been attached yet? > > > > dp83867.c has similar code. However quick grep did not find any > > mainline users with properties in the MAC node. If that is true, i > > would suggest removing the code looking in the parent for that phy > > driver. > > > > Andrew > > > > BR, Dinh -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: hisilicon: add OF dependency
On Saturday 17 October 2015 02:39:51 kbuild test robot wrote: > > [auto build test WARNING on net-next/master -- if it's inappropriate base, > please suggest rules for selecting the more suitable base] > > url: > https://github.com/0day-ci/linux/commits/Arnd-Bergmann/net-hisilicon-add-OF-dependency/20151016-173818 > config: ia64-allmodconfig (attached as .config) > reproduce: > wget > https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross > -O ~/bin/make.cross > chmod +x ~/bin/make.cross > # save the attached .config to linux build tree > make.cross ARCH=ia64 > I guess I should have sent my three patches as a series. The build bot found the same two bugs I had found myself and sent fixes for, but applying the 'add OF dependency' without the other two makes them appear in more configurations. For all I can see, the two reports here are fixed if all three of my patches are applied. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 net 3/3] openvswitch: Serialize nested ct actions if provided
On Fri, Oct 16, 2015 at 11:08 AM, Joe Stringer wrote: > If userspace provides a ct action with no nested mark or label, then the > storage for these fields is zeroed. Later when actions are requested, > such zeroed fields are serialized even though userspace didn't > originally specify them. Fix the behaviour by ensuring that no action is > serialized in this case, and reject actions where userspace attempts to > set these fields with mask=0. This should make netlink marshalling > consistent across deserialization/reserialization. > > Reported-by: Jarno Rajahalme > Signed-off-by: Joe Stringer Acked-by: Pravin B Shelar -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
Daniel Borkmann writes: > On 10/16/2015 07:42 PM, Alexei Starovoitov wrote: >> On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote: >>> Another question: >>> Should multiple mount of the filesystem result in an empty fs (a new >>> instance) or in one were one can see other ebpf-fs entities? I think >>> Daniel wanted to already use the mountpoint as some kind of hierarchy >>> delimiter. I would have used directories for that and multiple mounts >>> would then have resulted in the same content of the filesystem. IMHO >>> this would remove some ambiguity but then the question arises how this >>> is handled in a namespaced environment. Was there some specific reason >>> to do so? >> >> That's an interesting question! >> I think all mounts should be independent. >> I can see tracing using one and networking using another one >> with different hierarchies suitable for their own use cases. >> What's an advantage to have the same content everywhere? >> Feels harder to manage, since different users would need to >> coordinate. > > I initially had it as a mount_single() file system, where I was thinking > to have an entry under /sys/fs/bpf/, so all subsystems would work on top > of that mount point, but for the same reasons above I lifted that restriction. I am missing something. When I suggested using a filesystem it was my thought there would be exactly one superblock per map, and the map would be specified at mount time. You clearly are not implementing that. A filesystem per map makes sense as you have a key-value store with one file per key. The idea is that something resembling your bpf_pin_fd function would be the mount system call for the filesystem. The the keys in the map could be read by "ls /mountpoint/". Key values could be inspected with "cat /mountpoint/key". That allows all hierarchy etc to be handled in userspace, just as with my files for namespaces. I do not understand why you have presented to userspace a magic filesystem that you allow binding to. That is not what I intended to suggest and I do not know how that makes any sense. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: hix5hd2_gmac: avoid integer overload warning
On 10/16/2015 09:04 PM, Joe Perches wrote: BITS_RX_EN is an 'unsigned long' constant, so the ones complement of that has bits set that do not fit into a 32-bit variable on 64-bit architectures, which causes a harmless gcc warning: ... static void hix5hd2_port_disable(struct hix5hd2_priv *priv) { - writel_relaxed(~(BITS_RX_EN | BITS_TX_EN), priv->base + PORT_EN); + writel_relaxed(~(u32)(BITS_RX_EN | BITS_TX_EN), priv->base + PORT_EN); writel_relaxed(0, priv->base + DESC_WR_RD_ENA); ISTM that just means that the constants shouldn't be 'long'. Right, but that would probably mean changing the BIT() macro or not using it here. In the past I've argued against using that macro, but I've given up that fight. Fight on... (Somebody must have gone to USC here) There might be value in a BIT_U32 macro. Maybe BIT_U64 too. There's BIT_ULL() already. MBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 net] openvswitch: Scrub skb between namespaces
On Fri, Oct 16, 2015 at 11:08 AM, Joe Stringer wrote: > If OVS receives a packet from another namespace, then the packet should > be scrubbed. However, people have already begun to rely on the behaviour > that skb->mark is preserved across namespaces, so retain this one field. > > This is mainly to address information leakage between namespaces when > using OVS internal ports, but by placing it in ovs_vport_receive() it is > more generally applicable, meaning it should not be overlooked if other > port types are allowed to be moved into namespaces in future. > > Signed-off-by: Joe Stringer Acked-by: Pravin B Shelar -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: hisilicon: add OF dependency
Hi Arnd, [auto build test WARNING on net-next/master -- if it's inappropriate base, please suggest rules for selecting the more suitable base] url: https://github.com/0day-ci/linux/commits/Arnd-Bergmann/net-hisilicon-add-OF-dependency/20151016-173818 config: ia64-allmodconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=ia64 All warnings (new ones prefixed by >>): In file included from arch/ia64/include/asm/smp.h:20:0, from include/linux/smp.h:59, from include/linux/topology.h:33, from include/linux/gfp.h:8, from include/linux/kmod.h:22, from include/linux/module.h:13, from drivers/net/ethernet/hisilicon/hix5hd2_gmac.c:10: drivers/net/ethernet/hisilicon/hix5hd2_gmac.c: In function 'hix5hd2_port_disable': arch/ia64/include/asm/io.h:398:38: warning: large integer implicitly truncated to unsigned type [-Woverflow] #define writel_relaxed(v,a) __writel((v), (a)) ^ >> drivers/net/ethernet/hisilicon/hix5hd2_gmac.c:374:2: note: in expansion of >> macro 'writel_relaxed' writel_relaxed(~(BITS_RX_EN | BITS_TX_EN), priv->base + PORT_EN); ^ vim +/writel_relaxed +374 drivers/net/ethernet/hisilicon/hix5hd2_gmac.c 57c5bc9a Zhangfei Gao 2014-06-03 358 writel_relaxed(DEF_INT_MASK, priv->base + ENA_PMU_INT); 57c5bc9a Zhangfei Gao 2014-06-03 359 } 57c5bc9a Zhangfei Gao 2014-06-03 360 57c5bc9a Zhangfei Gao 2014-06-03 361 static void hix5hd2_irq_disable(struct hix5hd2_priv *priv) 57c5bc9a Zhangfei Gao 2014-06-03 362 { 57c5bc9a Zhangfei Gao 2014-06-03 363 writel_relaxed(0, priv->base + ENA_PMU_INT); 57c5bc9a Zhangfei Gao 2014-06-03 364 } 57c5bc9a Zhangfei Gao 2014-06-03 365 57c5bc9a Zhangfei Gao 2014-06-03 366 static void hix5hd2_port_enable(struct hix5hd2_priv *priv) 57c5bc9a Zhangfei Gao 2014-06-03 367 { 57c5bc9a Zhangfei Gao 2014-06-03 368 writel_relaxed(0xf, priv->base + DESC_WR_RD_ENA); 57c5bc9a Zhangfei Gao 2014-06-03 369 writel_relaxed(BITS_RX_EN | BITS_TX_EN, priv->base + PORT_EN); 57c5bc9a Zhangfei Gao 2014-06-03 370 } 57c5bc9a Zhangfei Gao 2014-06-03 371 57c5bc9a Zhangfei Gao 2014-06-03 372 static void hix5hd2_port_disable(struct hix5hd2_priv *priv) 57c5bc9a Zhangfei Gao 2014-06-03 373 { 57c5bc9a Zhangfei Gao 2014-06-03 @374 writel_relaxed(~(BITS_RX_EN | BITS_TX_EN), priv->base + PORT_EN); 57c5bc9a Zhangfei Gao 2014-06-03 375 writel_relaxed(0, priv->base + DESC_WR_RD_ENA); 57c5bc9a Zhangfei Gao 2014-06-03 376 } 57c5bc9a Zhangfei Gao 2014-06-03 377 57c5bc9a Zhangfei Gao 2014-06-03 378 static void hix5hd2_hw_set_mac_addr(struct net_device *dev) 57c5bc9a Zhangfei Gao 2014-06-03 379 { 57c5bc9a Zhangfei Gao 2014-06-03 380 struct hix5hd2_priv *priv = netdev_priv(dev); 57c5bc9a Zhangfei Gao 2014-06-03 381 unsigned char *mac = dev->dev_addr; 57c5bc9a Zhangfei Gao 2014-06-03 382 u32 val; :: The code at line 374 was first introduced by commit :: 57c5bc9ad7d799e9507ba6e993398d2c55f03fab net: hisilicon: add hix5hd2 mac driver :: TO: Zhangfei Gao :: CC: David S. Miller --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH] net: hisilicon: add OF dependency
On Sat, 2015-10-17 at 02:16 +0800, kbuild test robot wrote: > All errors (new ones prefixed by >>): >drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c: In function > 'hns_dsaf_get_cfg': > >> drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c:151:3: error: implicit > >> declaration of function 'iounmap' [-Werror=implicit-function-declaration] > iounmap(dsaf_dev->io_base); > ^ >cc1: some warnings being treated as errors > > vim +/iounmap +151 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c [] > 511e6bc0 huangdaode 2015-09-17 142 if > (!dma_set_mask_and_coherent(dsaf_dev->dev, DMA_BIT_MASK(64ULL))) btw: this should be DMA_BIT_MASK(64) without the ULL -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: hisilicon: add OF dependency
Hi Arnd, [auto build test ERROR on net-next/master -- if it's inappropriate base, please suggest rules for selecting the more suitable base] url: https://github.com/0day-ci/linux/commits/Arnd-Bergmann/net-hisilicon-add-OF-dependency/20151016-173818 config: um-allyesconfig (attached as .config) reproduce: # save the attached .config to linux build tree make ARCH=um All errors (new ones prefixed by >>): drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c: In function 'hns_dsaf_get_cfg': >> drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c:151:3: error: implicit >> declaration of function 'iounmap' [-Werror=implicit-function-declaration] iounmap(dsaf_dev->io_base); ^ cc1: some warnings being treated as errors vim +/iounmap +151 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c 511e6bc0 huangdaode 2015-09-17 135 dsaf_dev->buf_size_type = hns_rcb_buf_size2type(buf_size); 511e6bc0 huangdaode 2015-09-17 136 if (dsaf_dev->buf_size_type < 0) { 511e6bc0 huangdaode 2015-09-17 137 dev_err(dsaf_dev->dev, 511e6bc0 huangdaode 2015-09-17 138 "buf_size(%d) is wrong!\n", buf_size); 511e6bc0 huangdaode 2015-09-17 139 goto unmap_base_addr; 511e6bc0 huangdaode 2015-09-17 140 } 511e6bc0 huangdaode 2015-09-17 141 511e6bc0 huangdaode 2015-09-17 142 if (!dma_set_mask_and_coherent(dsaf_dev->dev, DMA_BIT_MASK(64ULL))) 511e6bc0 huangdaode 2015-09-17 143 dev_dbg(dsaf_dev->dev, "set mask to 64bit\n"); 511e6bc0 huangdaode 2015-09-17 144 else 511e6bc0 huangdaode 2015-09-17 145 dev_err(dsaf_dev->dev, "set mask to 64bit fail!\n"); 511e6bc0 huangdaode 2015-09-17 146 511e6bc0 huangdaode 2015-09-17 147 return 0; 511e6bc0 huangdaode 2015-09-17 148 511e6bc0 huangdaode 2015-09-17 149 unmap_base_addr: 511e6bc0 huangdaode 2015-09-17 150 if (dsaf_dev->io_base) 511e6bc0 huangdaode 2015-09-17 @151 iounmap(dsaf_dev->io_base); 511e6bc0 huangdaode 2015-09-17 152 if (dsaf_dev->ppe_base) 511e6bc0 huangdaode 2015-09-17 153 iounmap(dsaf_dev->ppe_base); 511e6bc0 huangdaode 2015-09-17 154 if (dsaf_dev->sds_base) 511e6bc0 huangdaode 2015-09-17 155 iounmap(dsaf_dev->sds_base); 511e6bc0 huangdaode 2015-09-17 156 if (dsaf_dev->sc_base) 511e6bc0 huangdaode 2015-09-17 157 iounmap(dsaf_dev->sc_base); 511e6bc0 huangdaode 2015-09-17 158 if (dsaf_dev->cpld_base) 511e6bc0 huangdaode 2015-09-17 159 iounmap(dsaf_dev->cpld_base); :: The code at line 151 was first introduced by commit :: 511e6bc071db1484d1a3d1d0bd4c244cf33910ff net: add Hisilicon Network Subsystem DSAF support :: TO: huangdaode :: CC: David S. Miller --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: SoCFPGA ethernet broken
On 16/10/15 08:56, Andrew Lunn wrote: >> So I think I'll move to inspect what Florian had suggested, and that was to >> look >> at: drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c::stmmac_mdio_register > > I have a suspicion. If you look at the phy driver it does: > > static int ksz9021_config_init(struct phy_device *phydev) > { > const struct device *dev = &phydev->dev; > const struct device_node *of_node = dev->of_node; > > if (!of_node && dev->parent->of_node) > of_node = dev->parent->of_node; > > > In your case, you don't have a phy node in your device tree, so of_node > is NULL. So it looks in the parent device. > > phylib: Make PHYs children of their MDIO bus, not the bus' parent. > > changed what the parent is. It is now the mdio device. Before, i > suspect it was the MAC. Hence it found your properties in the MAC > node. > > What i think you might want to do is change this code. Rather than > look a dev->parent->of_node; you might want > phydev->attached_dev->dev->of_node. Yes, that would work, by the time config_init() executes, you are guaranteed to have a valid attached_dev pointer. > > This assumes the phy has been attached to the MAC. I've no idea of the > ordering, so maybe it has not been attached yet? > > dp83867.c has similar code. However quick grep did not find any > mainline users with properties in the MAC node. If that is true, i > would suggest removing the code looking in the parent for that phy > driver. I think we should fix these drivers to look for these properties where they were used to, as a short term plan, and then make sure that their binding document is very specific about the fact that these properties, are by definition, properties of the Ethernet PHY, not the Ethernet MAC. -- Florian -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2 net] openvswitch: Scrub skb between namespaces
If OVS receives a packet from another namespace, then the packet should be scrubbed. However, people have already begun to rely on the behaviour that skb->mark is preserved across namespaces, so retain this one field. This is mainly to address information leakage between namespaces when using OVS internal ports, but by placing it in ovs_vport_receive() it is more generally applicable, meaning it should not be overlooked if other port types are allowed to be moved into namespaces in future. Signed-off-by: Joe Stringer --- v2: Add unlikely(), shift all within the netns check block. --- net/openvswitch/vport.c | 9 + 1 file changed, 9 insertions(+) diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c index fc5c0b9ccfe9..12a36ac21eda 100644 --- a/net/openvswitch/vport.c +++ b/net/openvswitch/vport.c @@ -444,6 +444,15 @@ int ovs_vport_receive(struct vport *vport, struct sk_buff *skb, OVS_CB(skb)->input_vport = vport; OVS_CB(skb)->mru = 0; + if (unlikely(dev_net(skb->dev) != ovs_dp_get_net(vport->dp))) { + u32 mark; + + mark = skb->mark; + skb_scrub_packet(skb, true); + skb->mark = mark; + tun_info = NULL; + } + /* Extract flow from 'skb' into 'key'. */ error = ovs_flow_key_extract(tun_info, skb, &key); if (unlikely(error)) { -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2 net 2/3] openvswitch: Treat IP_CT_RELATED as new
New, related connections are marked as such as part of ovs_ct_lookup(), but they are not marked as "new" if the commit flag is used. Make this consistent by treating IP_CT_RELATED as new as well. Reported-by: Jarno Rajahalme Signed-off-by: Joe Stringer Acked-by: Pravin B Shelar --- v2: Acked. --- net/openvswitch/conntrack.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 80bf702715bb..480dbb9095b7 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -86,6 +86,8 @@ static u8 ovs_ct_get_state(enum ip_conntrack_info ctinfo) ct_state |= OVS_CS_F_ESTABLISHED; break; case IP_CT_RELATED: + ct_state |= OVS_CS_F_NEW; + /* Fall through */ case IP_CT_RELATED_REPLY: ct_state |= OVS_CS_F_RELATED; break; -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2 net 1/3] openvswitch: Reject ct_state masks for unknown bits
Currently, 0-bits are generated in ct_state where the bit position is undefined, and matches are accepted on these bit-positions. If userspace requests to match the 0-value for this bit then it may expect only a subset of traffic to match this value, whereas currently all packets will have this bit set to 0. Fix this by rejecting such masks. Signed-off-by: Joe Stringer Acked-by: Pravin B Shelar --- v2: Remove extraneous ovs_ct_supported() function declaration. Acked. --- net/openvswitch/conntrack.h| 16 +--- net/openvswitch/flow_netlink.c | 5 - 2 files changed, 9 insertions(+), 12 deletions(-) diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h index da8714942c95..82e0dfc66028 100644 --- a/net/openvswitch/conntrack.h +++ b/net/openvswitch/conntrack.h @@ -35,12 +35,9 @@ void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key); int ovs_ct_put_key(const struct sw_flow_key *key, struct sk_buff *skb); void ovs_ct_free_action(const struct nlattr *a); -static inline bool ovs_ct_state_supported(u32 state) -{ - return !(state & ~(OVS_CS_F_NEW | OVS_CS_F_ESTABLISHED | -OVS_CS_F_RELATED | OVS_CS_F_REPLY_DIR | -OVS_CS_F_INVALID | OVS_CS_F_TRACKED)); -} +#define CT_SUPPORTED_MASK (OVS_CS_F_NEW | OVS_CS_F_ESTABLISHED | \ + OVS_CS_F_RELATED | OVS_CS_F_REPLY_DIR | \ + OVS_CS_F_INVALID | OVS_CS_F_TRACKED) #else #include @@ -53,11 +50,6 @@ static inline bool ovs_ct_verify(struct net *net, int attr) return false; } -static inline bool ovs_ct_state_supported(u32 state) -{ - return false; -} - static inline int ovs_ct_copy_action(struct net *net, const struct nlattr *nla, const struct sw_flow_key *key, struct sw_flow_actions **acts, bool log) @@ -94,5 +86,7 @@ static inline int ovs_ct_put_key(const struct sw_flow_key *key, } static inline void ovs_ct_free_action(const struct nlattr *a) { } + +#define CT_SUPPORTED_MASK 0 #endif /* CONFIG_NF_CONNTRACK */ #endif /* ovs_conntrack.h */ diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c index 171a691f1c32..bd710bc37469 100644 --- a/net/openvswitch/flow_netlink.c +++ b/net/openvswitch/flow_netlink.c @@ -816,7 +816,7 @@ static int metadata_from_nlattrs(struct net *net, struct sw_flow_match *match, ovs_ct_verify(net, OVS_KEY_ATTR_CT_STATE)) { u32 ct_state = nla_get_u32(a[OVS_KEY_ATTR_CT_STATE]); - if (!is_mask && !ovs_ct_state_supported(ct_state)) { + if (ct_state & ~CT_SUPPORTED_MASK) { OVS_NLERR(log, "ct_state flags %08x unsupported", ct_state); return -EINVAL; @@ -1099,6 +1099,9 @@ static void nlattr_set(struct nlattr *attr, u8 val, } else { memset(nla_data(nla), val, nla_len(nla)); } + + if (nla_type(nla) == OVS_KEY_ATTR_CT_STATE) + *(u32 *)nla_data(nla) &= CT_SUPPORTED_MASK; } } -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2 net 3/3] openvswitch: Serialize nested ct actions if provided
If userspace provides a ct action with no nested mark or label, then the storage for these fields is zeroed. Later when actions are requested, such zeroed fields are serialized even though userspace didn't originally specify them. Fix the behaviour by ensuring that no action is serialized in this case, and reject actions where userspace attempts to set these fields with mask=0. This should make netlink marshalling consistent across deserialization/reserialization. Reported-by: Jarno Rajahalme Signed-off-by: Joe Stringer --- v2: Reuse labels_nonzero(). Don't check for labels support during execution. --- net/openvswitch/conntrack.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 480dbb9095b7..b0287f13491a 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -224,9 +224,6 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key, struct nf_conn *ct; int err; - if (!IS_ENABLED(CONFIG_NF_CONNTRACK_LABELS)) - return -ENOTSUPP; - /* The connection could be invalid, in which case set_label is no-op.*/ ct = nf_ct_get(skb, &ctinfo); if (!ct) @@ -589,6 +586,10 @@ static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info, case OVS_CT_ATTR_MARK: { struct md_mark *mark = nla_data(a); + if (!mark->mask) { + OVS_NLERR(log, "ct_mark mask cannot be 0"); + return -EINVAL; + } info->mark = *mark; break; } @@ -597,6 +598,10 @@ static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info, case OVS_CT_ATTR_LABELS: { struct md_labels *labels = nla_data(a); + if (!labels_nonzero(&labels->mask)) { + OVS_NLERR(log, "ct_labels mask cannot be 0"); + return -EINVAL; + } info->labels = *labels; break; } @@ -707,11 +712,12 @@ int ovs_ct_action_to_attr(const struct ovs_conntrack_info *ct_info, if (IS_ENABLED(CONFIG_NF_CONNTRACK_ZONES) && nla_put_u16(skb, OVS_CT_ATTR_ZONE, ct_info->zone.id)) return -EMSGSIZE; - if (IS_ENABLED(CONFIG_NF_CONNTRACK_MARK) && + if (IS_ENABLED(CONFIG_NF_CONNTRACK_MARK) && ct_info->mark.mask && nla_put(skb, OVS_CT_ATTR_MARK, sizeof(ct_info->mark), &ct_info->mark)) return -EMSGSIZE; if (IS_ENABLED(CONFIG_NF_CONNTRACK_LABELS) && + labels_nonzero(&ct_info->labels.mask) && nla_put(skb, OVS_CT_ATTR_LABELS, sizeof(ct_info->labels), &ct_info->labels)) return -EMSGSIZE; -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: hix5hd2_gmac: avoid integer overload warning
On Fri, 2015-10-16 at 13:28 +0200, Arnd Bergmann wrote: > On Friday 16 October 2015 11:14:44 David Laight wrote: > > From: Arnd Bergmann > > > Sent: 16 October 2015 11:01 > > > BITS_RX_EN is an 'unsigned long' constant, so the ones complement of that > > > has bits set that do not fit into a 32-bit variable on 64-bit > > > architectures, > > > which causes a harmless gcc warning: > > ... > > > static void hix5hd2_port_disable(struct hix5hd2_priv *priv) > > > { > > > - writel_relaxed(~(BITS_RX_EN | BITS_TX_EN), priv->base + PORT_EN); > > > + writel_relaxed(~(u32)(BITS_RX_EN | BITS_TX_EN), priv->base + > > > PORT_EN); > > > writel_relaxed(0, priv->base + DESC_WR_RD_ENA); > > > > ISTM that just means that the constants shouldn't be 'long'. > > Right, but that would probably mean changing the BIT() macro or not using it > here. In the past I've argued against using that macro, but I've given > up that fight. Fight on... (Somebody must have gone to USC here) There might be value in a BIT_U32 macro. Maybe BIT_U64 too. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
On 10/16/2015 07:42 PM, Alexei Starovoitov wrote: On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote: Another question: Should multiple mount of the filesystem result in an empty fs (a new instance) or in one were one can see other ebpf-fs entities? I think Daniel wanted to already use the mountpoint as some kind of hierarchy delimiter. I would have used directories for that and multiple mounts would then have resulted in the same content of the filesystem. IMHO this would remove some ambiguity but then the question arises how this is handled in a namespaced environment. Was there some specific reason to do so? That's an interesting question! I think all mounts should be independent. I can see tracing using one and networking using another one with different hierarchies suitable for their own use cases. What's an advantage to have the same content everywhere? Feels harder to manage, since different users would need to coordinate. I initially had it as a mount_single() file system, where I was thinking to have an entry under /sys/fs/bpf/, so all subsystems would work on top of that mount point, but for the same reasons above I lifted that restriction. Cheers, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
On 10/16/15 10:21 AM, Hannes Frederic Sowa wrote: Another question: Should multiple mount of the filesystem result in an empty fs (a new instance) or in one were one can see other ebpf-fs entities? I think Daniel wanted to already use the mountpoint as some kind of hierarchy delimiter. I would have used directories for that and multiple mounts would then have resulted in the same content of the filesystem. IMHO this would remove some ambiguity but then the question arises how this is handled in a namespaced environment. Was there some specific reason to do so? That's an interesting question! I think all mounts should be independent. I can see tracing using one and networking using another one with different hierarchies suitable for their own use cases. What's an advantage to have the same content everywhere? Feels harder to manage, since different users would need to coordinate. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
On 10/16/15 at 10:32am, Alexei Starovoitov wrote: > On 10/16/15 9:43 AM, Hannes Frederic Sowa wrote: > >Oh, tracing does not allow daemons. Why? I can only imagine embedded > >users, no? > > yes and for networking: restartability and HA. > cannot really do that with fuse/daemons. Right, the smaller the footprint, the better. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
On 10/16/15 10:27 AM, Daniel Borkmann wrote: but don't know how flexible we are in terms of adding S_IFBPF to the UAPI. I don't think it should be a problem. You referred to POSIX Standard in your other mail but I can't see any reason why not to establish a new file mode. Anyway, FreeBSD (e.g. whiteouts) and Solaris (e.g. Doors, Event Ports) are just examples of new modes being added. mknod /bpf/map/1 m 1 1 :) Yes, maybe I think this is a better solution architectural instead of constructing a new filesystem. Yeah, also 'man 2 stat' lists a couple of others used by various systems. The pro's of this approach would be that no new file system would be needed and the special inode could be placed on top of any 'regular' file system that would support special files. I do like that as well. I don't like it at all for the reasons you've just stated: 'it will prevent us doing shell style access to such files' I'm wondering whether this would prevent us in future from opening access to shell tools etc on that special file, but probably one could provide a default set of file ops via init_special_inode() that could be overloaded by the underlying fs if required. and also because adding new S_ISSOCK-like bit for bpf feel as a begining of nightmare, since sooner or later all filesystems would need to have a check for it like they have for sock type. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
On 10/16/15 9:43 AM, Hannes Frederic Sowa wrote: Hi Alexei, On Fri, Oct 16, 2015, at 18:18, Alexei Starovoitov wrote: On 10/16/15 3:25 AM, Hannes Frederic Sowa wrote: Namespaces at some point dealt with the same problem, they nowadays use bind mounts of/proc/$$/ns/* to some place in the file hierarchy to keep the namespace alive. This at least allows someone to build up its own hierarchy with normal unix tools and not hidden inside a C-program. For filedescriptors we already have/proc/$$/fd/* but it seems that doesn't work out of the box nowadays. bind mounting of /proc/../fd was initially proposed by Andy and we've looked at it thoroughly, but after discussion with Eric it became apparent that it doesn't fit here. At the end we need shell tools to access maps. Oh yes, I want shell tools for this very much! Maybe even that things like strings, grep etc. work. :) yes and the only way to get there is to have it done via fs. Also I think you missed the hierarchy in this patch set _is_ built with normal 'mkdir' and files are removed with 'rm'. I did not miss that, I am just concerned that if the kernel does not enforce such a hierarchy automatically it won't really happen. if it's easier for user to work with single level of directories, it should be able to do so. It's not a job of the kernel to enforce how user space apps should be designed. Oh, tracing does not allow daemons. Why? I can only imagine embedded users, no? yes and for networking: restartability and HA. cannot really do that with fuse/daemons. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
On 10/16/2015 06:36 PM, Hannes Frederic Sowa wrote: On Fri, Oct 16, 2015, at 15:36, Daniel Borkmann wrote: On 10/16/2015 12:25 PM, Hannes Frederic Sowa wrote: On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote: This eventually leads us to this patch, which implements a minimal eBPF file system. The idea is a bit similar, but to the point that these inodes reside at one or multiple mount points. A directory hierarchy can be tailored to a specific application use-case from the various subsystem users and maps/progs pinned inside it. Two new eBPF commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in order to create one or multiple special inodes from an existing file descriptor that points to a map/program (we call it eBPF fd pinning), or to create a new file descriptor from an existing special inode. BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD can also be done unpriviledged when having appropriate permissions to the path. In my opinion this is very un-unixiy, I have to say at least. Namespaces at some point dealt with the same problem, they nowadays use bind mounts of /proc/$$/ns/* to some place in the file hierarchy to keep the namespace alive. This at least allows someone to build up its own hierarchy with normal unix tools and not hidden inside a C-program. For filedescriptors we already have /proc/$$/fd/* but it seems that doesn't work out of the box nowadays. Yes, that doesn't work out of the box, but I also don't know how usable that would really be. The idea is roughly rather similar to the paths passed to bind(2)/connect(2) on Unix domain sockets, as mentioned. You have a map/prog resource that you stick to a special inode so that you can retrieve it at a later point in time from the same or different processes through a new fd pointing to the resource from user side, so that the bpf(2) syscall can be performed upon it. With Unix tools, you could still create/remove a hierarchy or unlink those that have maps/progs. You are correct that tools that don't implement bpf(2) currently cannot access the content behind it, since bpf(2) manages access to the data itself. I did like the 2nd idea though, mentioned in the commit log, but don't know how flexible we are in terms of adding S_IFBPF to the UAPI. I don't think it should be a problem. You referred to POSIX Standard in your other mail but I can't see any reason why not to establish a new file mode. Anyway, FreeBSD (e.g. whiteouts) and Solaris (e.g. Doors, Event Ports) are just examples of new modes being added. mknod /bpf/map/1 m 1 1 :) Yes, maybe I think this is a better solution architectural instead of constructing a new filesystem. Yeah, also 'man 2 stat' lists a couple of others used by various systems. The pro's of this approach would be that no new file system would be needed and the special inode could be placed on top of any 'regular' file system that would support special files. I do like that as well. I'm wondering whether this would prevent us in future from opening access to shell tools etc on that special file, but probably one could provide a default set of file ops via init_special_inode() that could be overloaded by the underlying fs if required. I don't know in terms of how many objects bpf should be able to handle and if such a bind-mount based solution would work, I guess not. In my opinion I still favor a user space approach. Subsystems which use ebpf in a way that no user space program needs to be running to control them would need to export the fds by itself. E.g. something like sysfs/kobject for tc? The hierarchy would then be in control of the subsystem which could also create a proper naming hierarchy or maybe even use an already given one. Do most other eBPF users really need to persist file descriptors somewhere without user space control and pick them up later? I was thinking about a strict predefined hierarchy dictated by the kernel as well, but was then considering a more flexible approach that could be tailored freely to various use cases. A predefined hierarchy would most likely need to be resolved per subsystem and it's not really easy to map this properly. F.e. if the kernel would try to provide unique ids (as opposed to have a name or annotation member through the syscall), it could end up being quite cryptic. If we let the users choose names, I'm not sure if a single hierarchy level would be enough. Then, additionally you have facilities like tail calls that eBPF programs could do. I don't think that most subsystems need to expose those file descriptors. Seccomp probably will have a supervisor process running and per aggregation will also have a user space process running keeping the fd alive. So it is all about tc/sched. And I am not sure if tc will really needs a filesystem to handle all this. The simplest approach is to just keep a name <-> fd mapping somewhere in the net/sched/ subsystem and use this for all tc users. Solving this on a generic
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote: > This eventually leads us to this patch, which implements a minimal > eBPF file system. The idea is a bit similar, but to the point that > these inodes reside at one or multiple mount points. A directory > hierarchy can be tailored to a specific application use-case from the > various subsystem users and maps/progs pinned inside it. Two new eBPF > commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in > order to create one or multiple special inodes from an existing file > descriptor that points to a map/program (we call it eBPF fd pinning), > or to create a new file descriptor from an existing special inode. > BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD > can also be done unpriviledged when having appropriate permissions > to the path. > Another question: Should multiple mount of the filesystem result in an empty fs (a new instance) or in one were one can see other ebpf-fs entities? I think Daniel wanted to already use the mountpoint as some kind of hierarchy delimiter. I would have used directories for that and multiple mounts would then have resulted in the same content of the filesystem. IMHO this would remove some ambiguity but then the question arises how this is handled in a namespaced environment. Was there some specific reason to do so? Thanks, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch net-next v5 3/8] switchdev: allow caller to explicitly request attr_set as deferred
On 15-10-16 09:20 AM, Scott Feldman wrote: > On Fri, Oct 16, 2015 at 1:23 AM, Jiri Pirko wrote: >> Thu, Oct 15, 2015 at 05:21:22PM CEST, john.fastab...@gmail.com wrote: >>> On 15-10-14 10:40 AM, Jiri Pirko wrote: From: Jiri Pirko Caller should know if he can call attr_set directly (when holding RTNL) or if he has to defer the att_set processing for later. This also allows drivers to sleep inside attr_set and report operation status back to switchdev core. Switchdev core then warns if status is not ok, instead of silent errors happening in drivers. Benefit from newly introduced switchdev deferred ops infrastructure. >>> >>> A nit but the patch description should note your setting the defer bit >>> on the bridge set state. >>> Signed-off-by: Jiri Pirko --- include/net/switchdev.h | 1 + net/bridge/br_stp.c | 3 +- net/switchdev/switchdev.c | 108 ++ 3 files changed, 46 insertions(+), 66 deletions(-) diff --git a/include/net/switchdev.h b/include/net/switchdev.h index d1c7f90..f7de6f8 100644 --- a/include/net/switchdev.h +++ b/include/net/switchdev.h @@ -17,6 +17,7 @@ #define SWITCHDEV_F_NO_RECURSE BIT(0) #define SWITCHDEV_F_SKIP_EOPNOTSUPP BIT(1) +#define SWITCHDEV_F_DEFER BIT(2) struct switchdev_trans_item { struct list_head list; diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c index db6d243de..80c34d7 100644 --- a/net/bridge/br_stp.c +++ b/net/bridge/br_stp.c @@ -41,13 +41,14 @@ void br_set_state(struct net_bridge_port *p, unsigned int state) { struct switchdev_attr attr = { .id = SWITCHDEV_ATTR_ID_PORT_STP_STATE, +.flags = SWITCHDEV_F_DEFER, .u.stp_state = state, }; >>> >>> >>> This creates a possible race (with 6/8) I think, please check! >> >> Wait. This patch does not change the previous behaviour. Patch 6 does, >> so I don't understand why you are asking here. Confusing. >> >> >>> >>> In del_nbp() we call br_stp_disable_port() to set the port state >>> to BR_STATE_DISABLE and disabling learning events. But with this >>> patch it can be deferred. Also note the STP agent may be in userspace >>> which actually seems more likely the case because you likely want to >>> run some more modern variant of STP than the kernel supports. >>> >>> So at some point in the future the driver will turn off learning. At >>> the same time we call br_fdb_delete_by_port which calls a deferred >>> set of fdb deletes. >>> >>> I don't see how you guarantee learning is off before you start doing >>> the deletes here and possibly learning new addresses after the software >>> side believes the port is down. >>> >>> So >>> >>> br_stp_disable_port >>> br_fdb_delete_by_port >>> {fdb_del_external_learn} >>> [hw learns a fdb] >>> [hw disables learning] >>> >>> What stops this from happening? >> >> Okay. This behaviour is the same as without the patchset. What would >> resolve the issue it to put switchdev_deferred_process() after >> br_stp_disable_port() and before br_fdb_delete_by_port() call. >> That would enforce stp change to happen in hw before fdbs are explicitly >> deleted. Sound good to you? > > Doesn't HW already see things in the right order since items are > dequeued from the deferred list in the order queued? > I just noticed there is the rtnl lock around the deferred process so this should mean only a single deferred task is run at a time. OK sort of a big hammer though. +static void switchdev_deferred_process_work(struct work_struct *work) +{ + rtnl_lock(); + switchdev_deferred_process(); + rtnl_unlock(); +} But I don't see how this block of code works, spin_lock_bh(&br->lock); br_stp_disable_port(p); spin_unlock_bh(&br->lock); br_ifinfo_notify(RTM_DELLINK, p); list_del_rcu(&p->list); nbp_vlan_flush(p); br_fdb_delete_by_port(br, p, 0, 1); switchdev_deferred_process(); If you defer disabling learning then setup a bunch of deletes also deferred, what guarantees that learning was disabled before you setup the deferred fdb delete list? It seems you may call rocker_port_fdb_learn_work() after the br_fdb_delete_by_port call above but before the deferred learn disable call which would have disabled it. Also the rocker_port_fdb_learn_work queue is on a lw->work and doesn't use the same locking? Its sort of an interesting hieararchy of work queues being built up here. I'm being a bit overly paranoid at this point because I've seen some really ugly bugs around these types of things that are difficult to spot because they are correctness bugs instead of hard crashes. .John -- To unsubscribe from this list: send the line "
Re: [PATCH nf-next 1/6] netfilter-ipv4: Line layout whitespace fixes
On Wed, Oct 14, 2015 at 11:17:03PM +0100, Ian Morris wrote: > Cleanses some whitespace issues by removing a leading space before a tab. > > No changes detected by objdiff. > > Signed-off-by: Ian Morris > --- > net/ipv4/netfilter/ipt_ECN.c | 2 +- > net/ipv4/netfilter/nf_nat_pptp.c | 2 +- > net/ipv4/netfilter/nf_nat_snmp_basic.c | 2 +- > 3 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/net/ipv4/netfilter/ipt_ECN.c b/net/ipv4/netfilter/ipt_ECN.c > index 2707652..6592708 100644 > --- a/net/ipv4/netfilter/ipt_ECN.c > +++ b/net/ipv4/netfilter/ipt_ECN.c > @@ -24,7 +24,7 @@ MODULE_AUTHOR("Harald Welte "); > MODULE_DESCRIPTION("Xtables: Explicit Congestion Notification (ECN) flag > modification"); > > /* set ECT codepoint from IP header. > - * return false if there was an error. */ > + * return false if there was an error. */ In netdev coding style we prefer for multiline comments: /* This is a comment blah blah blah blah blah blah blah blah blah blah blah blah * blah blah blah. */ For single line: /* This is a comment blah blah blah */ This case, I suggest we can even get rid of that comment there since it's obvious what set_ect_ip() is doing by reading the function name and looking at what it returns. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next 0/6] coding style improvements: netfilter-ipv4
On Wed, Oct 14, 2015 at 11:17:02PM +0100, Ian Morris wrote: > This series of patches improves the coding style of the netfilter-ipv4 > code by addressing some issues detected by checkpatch. > > The changes were previously submitted as part of a larger monolithic > patch but on advice from Pablo, these are being re-sent in smaller, > more structured batches. Series applied, except patch 1/6 that I have kept back. I'll follow up with an explanation. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: hisilicon: add OF dependency
Hi Arnd, [auto build test ERROR on net-next/master -- if it's inappropriate base, please suggest rules for selecting the more suitable base] url: https://github.com/0day-ci/linux/commits/Arnd-Bergmann/net-hisilicon-add-OF-dependency/20151016-173818 config: x86_64-allmodconfig reproduce: make ARCH=x86_64 allmodconfig make ARCH=x86_64 All error/warnings (new ones prefixed by >>): vim +/vzalloc +1123 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c 511e6bc0 huangdaode 2015-09-17 1117 511e6bc0 huangdaode 2015-09-17 1118ret = hns_dsaf_init_hw(dsaf_dev); 511e6bc0 huangdaode 2015-09-17 1119if (ret) 511e6bc0 huangdaode 2015-09-17 1120return ret; 511e6bc0 huangdaode 2015-09-17 1121 511e6bc0 huangdaode 2015-09-17 1122/* malloc mem for tcam mac key(vlan+mac) */ 511e6bc0 huangdaode 2015-09-17 @1123priv->soft_mac_tbl = vzalloc(sizeof(*priv->soft_mac_tbl) 511e6bc0 huangdaode 2015-09-17 1124 * DSAF_TCAM_SUM); 511e6bc0 huangdaode 2015-09-17 1125if (!priv->soft_mac_tbl) { 511e6bc0 huangdaode 2015-09-17 1126ret = -ENOMEM; 511e6bc0 huangdaode 2015-09-17 1127goto remove_hw; 511e6bc0 huangdaode 2015-09-17 1128} 511e6bc0 huangdaode 2015-09-17 1129 511e6bc0 huangdaode 2015-09-17 1130/*all entry invall */ 511e6bc0 huangdaode 2015-09-17 1131for (i = 0; i < DSAF_TCAM_SUM; i++) 511e6bc0 huangdaode 2015-09-17 1132(priv->soft_mac_tbl + i)->index = DSAF_INVALID_ENTRY_IDX; 511e6bc0 huangdaode 2015-09-17 1133 511e6bc0 huangdaode 2015-09-17 1134return 0; 511e6bc0 huangdaode 2015-09-17 1135 511e6bc0 huangdaode 2015-09-17 1136 remove_hw: 511e6bc0 huangdaode 2015-09-17 1137hns_dsaf_remove_hw(dsaf_dev); 511e6bc0 huangdaode 2015-09-17 1138return ret; 511e6bc0 huangdaode 2015-09-17 1139 } 511e6bc0 huangdaode 2015-09-17 1140 511e6bc0 huangdaode 2015-09-17 1141 /** 511e6bc0 huangdaode 2015-09-17 1142 * hns_dsaf_free - free dsa fabric 511e6bc0 huangdaode 2015-09-17 1143 * @dsaf_dev: dsa fabric device struct pointer 511e6bc0 huangdaode 2015-09-17 1144 */ 511e6bc0 huangdaode 2015-09-17 1145 static void hns_dsaf_free(struct dsaf_device *dsaf_dev) 511e6bc0 huangdaode 2015-09-17 1146 { 511e6bc0 huangdaode 2015-09-17 1147struct dsaf_drv_priv *priv = 511e6bc0 huangdaode 2015-09-17 1148(struct dsaf_drv_priv *)hns_dsaf_dev_priv(dsaf_dev); 511e6bc0 huangdaode 2015-09-17 1149 511e6bc0 huangdaode 2015-09-17 1150hns_dsaf_remove_hw(dsaf_dev); 511e6bc0 huangdaode 2015-09-17 1151 511e6bc0 huangdaode 2015-09-17 1152/* free all mac mem */ 511e6bc0 huangdaode 2015-09-17 @1153vfree(priv->soft_mac_tbl); 511e6bc0 huangdaode 2015-09-17 1154priv->soft_mac_tbl = NULL; 511e6bc0 huangdaode 2015-09-17 1155 } 511e6bc0 huangdaode 2015-09-17 1156 :: The code at line 1123 was first introduced by commit :: 511e6bc071db1484d1a3d1d0bd4c244cf33910ff net: add Hisilicon Network Subsystem DSAF support :: TO: huangdaode :: CC: David S. Miller --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SoCFPGA ethernet broken
On 10/16/2015 08:56 AM, Andrew Lunn wrote: So I think I'll move to inspect what Florian had suggested, and that was to look at: drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c::stmmac_mdio_register I have a suspicion. If you look at the phy driver it does: static int ksz9021_config_init(struct phy_device *phydev) { const struct device *dev = &phydev->dev; const struct device_node *of_node = dev->of_node; if (!of_node && dev->parent->of_node) of_node = dev->parent->of_node; Maybe we need to walk up the hierarchy. Perhaps something like: const struct device *dev_walker; dev_walker = &phydev->dev; do { of_node = dev_walker->of_node; dev_walker = dev_walker->parent; } while (!of_node && dev_walker); An alternative would be to assign the bus the same of_node as the bus parent. If either approach works, you can add: Acked-by: David Daney to the patch that implements it. In your case, you don't have a phy node in your device tree, so of_node is NULL. So it looks in the parent device. phylib: Make PHYs children of their MDIO bus, not the bus' parent. changed what the parent is. It is now the mdio device. Before, i suspect it was the MAC. Hence it found your properties in the MAC node. What i think you might want to do is change this code. Rather than look a dev->parent->of_node; you might want phydev->attached_dev->dev->of_node. This assumes the phy has been attached to the MAC. I've no idea of the ordering, so maybe it has not been attached yet? dp83867.c has similar code. However quick grep did not find any mainline users with properties in the MAC node. If that is true, i would suggest removing the code looking in the parent for that phy driver. Andrew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
Hi Alexei, On Fri, Oct 16, 2015, at 18:18, Alexei Starovoitov wrote: > On 10/16/15 3:25 AM, Hannes Frederic Sowa wrote: > > Namespaces at some point dealt with the same problem, they nowadays use > > bind mounts of/proc/$$/ns/* to some place in the file hierarchy to keep > > the namespace alive. This at least allows someone to build up its own > > hierarchy with normal unix tools and not hidden inside a C-program. For > > filedescriptors we already have/proc/$$/fd/* but it seems that doesn't > > work out of the box nowadays. > > bind mounting of /proc/../fd was initially proposed by Andy and we've > looked at it thoroughly, but after discussion with Eric it became > apparent that it doesn't fit here. At the end we need shell tools > to access maps. Oh yes, I want shell tools for this very much! Maybe even that things like strings, grep etc. work. :) > Also I think you missed the hierarchy in this patch set _is_ built with > normal 'mkdir' and files are removed with 'rm'. I did not miss that, I am just concerned that if the kernel does not enforce such a hierarchy automatically it won't really happen. > The only thing that C does is BPF_PIN_FD of fd that was received from > bpf syscall. That's way cleaner api than doing bind mount from C > program. I am with you there. Unfortunately we don't have a give "this fd a name" syscalls so far so I totally understand the decision here. > We've considered letting open() of the file return bpf specific > anon-inode, but decided to reserve that for other more natural file > operations. Therefore BPF_NEW_FD is needed. Can't this be overloaded somehow. You can use mknod for creation and open for regular file use. mknod is its own syscall. > > I don't know in terms of how many objects bpf should be able to handle > > and if such a bind-mount based solution would work, I guess not. > > We definitely missed you at the last plumbers where it was discussed :) Yes. :( > > In my opinion I still favor a user space approach. > > that's not acceptable for tracing use cases. No daemons allowed. Oh, tracing does not allow daemons. Why? I can only imagine embedded users, no? > > Subsystems which use > > ebpf in a way that no user space program needs to be running to control > > them would need to export the fds by itself. E.g. something like > > sysfs/kobject for tc? The hierarchy would then be in control of the > > subsystem which could also create a proper naming hierarchy or maybe > > even use an already given one. Do most other eBPF users really need to > > persist file descriptors somewhere without user space control and pick > > them up later? > > I think it's way cleaner to have one way of solving it (like this patch > does) instead of asking every subsystem to solve it differently. > We've also looked at sysfs and it's ugly when it comes to removing, > since the user cannot use normal 'rm'. Ah, okay. Probably it would depend on some tc node always referencing the bpf entity. But I see that sysfs might become too problematic. Bye, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] netfilter: turn NF_HOOK into an inline function
On Fri, Oct 09, 2015 at 08:45:42PM +0200, Arnd Bergmann wrote: > A recent change to the dst_output handling caused a new warning > when the call to NF_HOOK() is the only used of a local variable > passed as 'dev', and CONFIG_NETFILTER is disabled: > > net/ipv6/ip6_output.c: In function 'ip6_output': > net/ipv6/ip6_output.c:135:21: warning: unused variable 'dev' > [-Wunused-variable] > > The reason for this is that the NF_HOOK macro in this case does > not reference the variable at all, and the call to dev_net(dev) > got removed from the ip6_output function. To avoid that warning now > and in the future, this changes the macro into an equivalent > inline function, which tells the compiler that the variable is > passed correctly but still unused. > > The dn_forward function apparently had the same problem in > the past and added a local workaround that no longer works > with the inline function. In order to avoid a regression, we > have to also remove the #ifdef from decnet in the same patch. Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
On Fri, Oct 16, 2015, at 15:36, Daniel Borkmann wrote: > On 10/16/2015 12:25 PM, Hannes Frederic Sowa wrote: > > On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote: > >> This eventually leads us to this patch, which implements a minimal > >> eBPF file system. The idea is a bit similar, but to the point that > >> these inodes reside at one or multiple mount points. A directory > >> hierarchy can be tailored to a specific application use-case from the > >> various subsystem users and maps/progs pinned inside it. Two new eBPF > >> commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in > >> order to create one or multiple special inodes from an existing file > >> descriptor that points to a map/program (we call it eBPF fd pinning), > >> or to create a new file descriptor from an existing special inode. > >> BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD > >> can also be done unpriviledged when having appropriate permissions > >> to the path. > > > > In my opinion this is very un-unixiy, I have to say at least. > > > > Namespaces at some point dealt with the same problem, they nowadays use > > bind mounts of /proc/$$/ns/* to some place in the file hierarchy to keep > > the namespace alive. This at least allows someone to build up its own > > hierarchy with normal unix tools and not hidden inside a C-program. For > > filedescriptors we already have /proc/$$/fd/* but it seems that doesn't > > work out of the box nowadays. > > Yes, that doesn't work out of the box, but I also don't know how usable > that would really be. The idea is roughly rather similar to the paths > passed to bind(2)/connect(2) on Unix domain sockets, as mentioned. You > have a map/prog resource that you stick to a special inode so that you > can retrieve it at a later point in time from the same or different > processes through a new fd pointing to the resource from user side, so > that the bpf(2) syscall can be performed upon it. > > With Unix tools, you could still create/remove a hierarchy or unlink > those that have maps/progs. You are correct that tools that don't > implement bpf(2) currently cannot access the content behind it, since > bpf(2) manages access to the data itself. I did like the 2nd idea though, > mentioned in the commit log, but don't know how flexible we are in > terms of adding S_IFBPF to the UAPI. I don't think it should be a problem. You referred to POSIX Standard in your other mail but I can't see any reason why not to establish a new file mode. Anyway, FreeBSD (e.g. whiteouts) and Solaris (e.g. Doors, Event Ports) are just examples of new modes being added. mknod /bpf/map/1 m 1 1 :) Yes, maybe I think this is a better solution architectural instead of constructing a new filesystem. > > I don't know in terms of how many objects bpf should be able to handle > > and if such a bind-mount based solution would work, I guess not. > > > > In my opinion I still favor a user space approach. Subsystems which use > > ebpf in a way that no user space program needs to be running to control > > them would need to export the fds by itself. E.g. something like > > sysfs/kobject for tc? The hierarchy would then be in control of the > > subsystem which could also create a proper naming hierarchy or maybe > > even use an already given one. Do most other eBPF users really need to > > persist file descriptors somewhere without user space control and pick > > them up later? > > I was thinking about a strict predefined hierarchy dictated by the kernel > as well, but was then considering a more flexible approach that could be > tailored freely to various use cases. A predefined hierarchy would most > likely need to be resolved per subsystem and it's not really easy to map > this properly. F.e. if the kernel would try to provide unique ids (as > opposed to have a name or annotation member through the syscall), it > could end up being quite cryptic. If we let the users choose names, I'm > not sure if a single hierarchy level would be enough. Then, additionally > you have facilities like tail calls that eBPF programs could do. I don't think that most subsystems need to expose those file descriptors. Seccomp probably will have a supervisor process running and per aggregation will also have a user space process running keeping the fd alive. So it is all about tc/sched. And I am not sure if tc will really needs a filesystem to handle all this. The simplest approach is to just keep a name <-> fd mapping somewhere in the net/sched/ subsystem and use this for all tc users. Otherwise can we somehow Incorporate this in sysfs directory where we maybe create a kobject per installed filter, something along those lines. I see that tail calls makes this all very difficult to show which entity uses which ebpf entity in some way, as it looks like n:m relationships. > In such cases, one could even craft relationships where a (strict auto > generated) tree representation would not be sufficient (f.e. > recirculati
Re: [PATCH] ipv6: no addrconf for slave devices
On Fri, Oct 16, 2015 at 6:14 PM, David Ahern wrote: > On 10/16/15 10:12 AM, Jan Blunck wrote: >> >> On Fri, Oct 16, 2015 at 6:02 PM, David Ahern >> wrote: >>> >>> On 10/16/15 9:57 AM, Jan Blunck wrote: I don't think that enslaved ports should get network layer addresses. This is one example with a team device: >>> >>> >>> >>> for VRF devices we do want the enslaved links to have link local >>> addresses. >>> >> >> That is interesting. As far I can see you are setting IFF_SLAVE in >> do_vrf_add_slave() and therefore already stop IPv6 addrconf. >> > > Check net-next. That had to be removed to get IPv6 working. > Thanks for the pointer. So it would be better to differentiate between L2 and L3 ports and only start addrconf on later ones? I don't think there is a flag that allows for that though. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: phy: smsc: disable energy detect mode
2015-10-13 21:17 GMT-07:00 Heiko Schocher : > Hello Florian, > > > Am 13.10.2015 um 21:26 schrieb Florian Fainelli: >> >> On 12/10/15 22:13, Heiko Schocher wrote: >>> >>> On some boards the energy enable detect mode leads in >>> trouble with some switches, so make the enabling of >>> this mode configurable through DT. >>> >>> Signed-off-by: Heiko Schocher >>> --- >>> >>> .../devicetree/bindings/net/smsc-lan87xx.txt | 19 >>> + >>> drivers/net/phy/smsc.c | 24 >>> +- >>> 2 files changed, 38 insertions(+), 5 deletions(-) >>> create mode 100644 >>> Documentation/devicetree/bindings/net/smsc-lan87xx.txt >>> >>> diff --git a/Documentation/devicetree/bindings/net/smsc-lan87xx.txt >>> b/Documentation/devicetree/bindings/net/smsc-lan87xx.txt >>> new file mode 100644 >>> index 000..39aa1dc >>> --- /dev/null >>> +++ b/Documentation/devicetree/bindings/net/smsc-lan87xx.txt >>> @@ -0,0 +1,19 @@ >>> +SMSC LAN87xx Ethernet PHY >>> + >>> +Some boards require special tuning values. Configure them >>> +through an Ethernet OF device node. >>> + >>> +Optional properties: >>> + >>> +- disable-energy-detect: >>> + If set, do not enable energy detect mode for the SMSC phy. >>> + default: enable energy detect mode >> >> >> Although energy detection is something that is implemented by many PHYs, >> I am not sure a generic property is suitable here, I would prefix that >> with the SMSC vendor prefix here to make it clear this only applies to >> this PHY. > > > Hmm... but all PHYs should be able to enable, disable it in some way, or? It may not always be controlled directly at the PHY level, sometimes this is something that needs cooperation with the Ethernet MAC as well in case of integrated designs. > >> Would not you want to make it a reverse property here though, something >> like this: >> >> smsc,energy-detect: boolean, when present indicates the PHY reliably >> supports energy detection > > > Yes, that was also my first thought, but currently, on this PHYs > energy detect mode is on ... and if I introduce such a property, > it will disable it for all existing boards, because property is > missing ... so, maybe I break boards ... Fair enough, how about smsc,disabled-energy-detect or something like that then? -- Florian -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch net-next v5 3/8] switchdev: allow caller to explicitly request attr_set as deferred
On Fri, Oct 16, 2015 at 1:23 AM, Jiri Pirko wrote: > Thu, Oct 15, 2015 at 05:21:22PM CEST, john.fastab...@gmail.com wrote: >>On 15-10-14 10:40 AM, Jiri Pirko wrote: >>> From: Jiri Pirko >>> >>> Caller should know if he can call attr_set directly (when holding RTNL) >>> or if he has to defer the att_set processing for later. >>> >>> This also allows drivers to sleep inside attr_set and report operation >>> status back to switchdev core. Switchdev core then warns if status is >>> not ok, instead of silent errors happening in drivers. >>> >>> Benefit from newly introduced switchdev deferred ops infrastructure. >>> >> >>A nit but the patch description should note your setting the defer bit >>on the bridge set state. >> >>> Signed-off-by: Jiri Pirko >>> --- >>> include/net/switchdev.h | 1 + >>> net/bridge/br_stp.c | 3 +- >>> net/switchdev/switchdev.c | 108 >>> ++ >>> 3 files changed, 46 insertions(+), 66 deletions(-) >>> >>> diff --git a/include/net/switchdev.h b/include/net/switchdev.h >>> index d1c7f90..f7de6f8 100644 >>> --- a/include/net/switchdev.h >>> +++ b/include/net/switchdev.h >>> @@ -17,6 +17,7 @@ >>> >>> #define SWITCHDEV_F_NO_RECURSE BIT(0) >>> #define SWITCHDEV_F_SKIP_EOPNOTSUPP BIT(1) >>> +#define SWITCHDEV_F_DEFER BIT(2) >>> >>> struct switchdev_trans_item { >>> struct list_head list; >>> diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c >>> index db6d243de..80c34d7 100644 >>> --- a/net/bridge/br_stp.c >>> +++ b/net/bridge/br_stp.c >>> @@ -41,13 +41,14 @@ void br_set_state(struct net_bridge_port *p, unsigned >>> int state) >>> { >>> struct switchdev_attr attr = { >>> .id = SWITCHDEV_ATTR_ID_PORT_STP_STATE, >>> +.flags = SWITCHDEV_F_DEFER, >>> .u.stp_state = state, >>> }; >> >> >>This creates a possible race (with 6/8) I think, please check! > > Wait. This patch does not change the previous behaviour. Patch 6 does, > so I don't understand why you are asking here. Confusing. > > >> >>In del_nbp() we call br_stp_disable_port() to set the port state >>to BR_STATE_DISABLE and disabling learning events. But with this >>patch it can be deferred. Also note the STP agent may be in userspace >>which actually seems more likely the case because you likely want to >>run some more modern variant of STP than the kernel supports. >> >>So at some point in the future the driver will turn off learning. At >>the same time we call br_fdb_delete_by_port which calls a deferred >>set of fdb deletes. >> >>I don't see how you guarantee learning is off before you start doing >>the deletes here and possibly learning new addresses after the software >>side believes the port is down. >> >>So >> >> br_stp_disable_port >> br_fdb_delete_by_port >> {fdb_del_external_learn} >> [hw learns a fdb] >> [hw disables learning] >> >>What stops this from happening? > > Okay. This behaviour is the same as without the patchset. What would > resolve the issue it to put switchdev_deferred_process() after > br_stp_disable_port() and before br_fdb_delete_by_port() call. > That would enforce stp change to happen in hw before fdbs are explicitly > deleted. Sound good to you? Doesn't HW already see things in the right order since items are dequeued from the deferred list in the order queued? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
On 10/16/15 3:25 AM, Hannes Frederic Sowa wrote: Namespaces at some point dealt with the same problem, they nowadays use bind mounts of/proc/$$/ns/* to some place in the file hierarchy to keep the namespace alive. This at least allows someone to build up its own hierarchy with normal unix tools and not hidden inside a C-program. For filedescriptors we already have/proc/$$/fd/* but it seems that doesn't work out of the box nowadays. bind mounting of /proc/../fd was initially proposed by Andy and we've looked at it thoroughly, but after discussion with Eric it became apparent that it doesn't fit here. At the end we need shell tools to access maps. Also I think you missed the hierarchy in this patch set _is_ built with normal 'mkdir' and files are removed with 'rm'. The only thing that C does is BPF_PIN_FD of fd that was received from bpf syscall. That's way cleaner api than doing bind mount from C program. We've considered letting open() of the file return bpf specific anon-inode, but decided to reserve that for other more natural file operations. Therefore BPF_NEW_FD is needed. I don't know in terms of how many objects bpf should be able to handle and if such a bind-mount based solution would work, I guess not. We definitely missed you at the last plumbers where it was discussed :) In my opinion I still favor a user space approach. that's not acceptable for tracing use cases. No daemons allowed. Subsystems which use ebpf in a way that no user space program needs to be running to control them would need to export the fds by itself. E.g. something like sysfs/kobject for tc? The hierarchy would then be in control of the subsystem which could also create a proper naming hierarchy or maybe even use an already given one. Do most other eBPF users really need to persist file descriptors somewhere without user space control and pick them up later? I think it's way cleaner to have one way of solving it (like this patch does) instead of asking every subsystem to solve it differently. We've also looked at sysfs and it's ugly when it comes to removing, since the user cannot use normal 'rm'. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html