Re: [PATCH][xfrm-next] xfrm6: remove BUG_ON from xfrm6_dst_ifdown

2018-11-22 Thread Steffen Klassert
On Mon, Nov 12, 2018 at 05:28:22PM +0800, Li RongQing wrote:
> if loopback_idev is NULL pointer, and the following access of
> loopback_idev will trigger panic, which is same as BUG_ON
> 
> Signed-off-by: Li RongQing 

Patch applied, thanks!


Suggesting patch for tcp_close

2018-11-22 Thread 배석진
Dear all,


This is Soukin Bae working on Samsung Elec. Mobile Division.
we have a problem with tcp closing.

in shortly, 
1. on 4-way handshking to close session
2. if ack pkt is not arrived from opposite side
3. then session can't be closed forever


in mobile device, condition 2 can be happend in various case.
like as turn off wifi or mobile data. or bad condition of air network, etc..

this could be occur in both side of connection.
when issue happened during active closing, the session remained with FIN_WAIT1 
state.
and at passive closing, remained with LAST_ACK state.

-
below is test result after wifi on/off repetition (without mobile data).
maybe 'Foreign Address' sent the fin-ack when wifi-off state.
so device coun't recieve ack pkt further, and the session is remained 
permanently.
and their count is growing up. this is resource leak.

### turn on wifi
D:\Test>adb shell netstat -npWae
Proto Recv-Q Send-Q Local Address Foreign 
Address   State   User   Inode   PID/Program Name
tcp0  0 127.0.0.1:50370.0.0.0:* 
LISTEN  0  36357   6907/adbd
tcp6   0  0 2001:2d8:ed1c:de1c:bd94:fe5:2d9a:d8e4:58660   
2404:6800:4008:c00::bc:5228   ESTABLISHED 10041  74347   
6523/com.google.android.gms.persistent
tcp6   0  0 2001:2d8:ed1c:de1c:bd94:fe5:2d9a:d8e4:35148   
2404:6800:4004:800::2003:80   LAST_ACK0  0   -
tcp6   0  0 2001:2d8:ed1c:de1c:bd94:fe5:2d9a:d8e4:37512   
64:ff9b::3444:f3dc:443ESTABLISHED 10137  77447   
9522/com.samsung.android.game.gos
tcp6   0  0 2001:2d8:ed1c:de1c:bd94:fe5:2d9a:d8e4:49294   
2404:6800:4005:80c::2004:443  LAST_ACK0  0   -
tcp6   1  0 2001:2d8:ed1c:de1c:bd94:fe5:2d9a:d8e4:35260   
64:ff9b::34d0:9421:80 LAST_ACK0  0   -

### turn off wifi
D:\Test>adb shell netstat -npWae
Proto Recv-Q Send-Q Local Address Foreign 
Address   State   User   Inode   PID/Program Name
tcp0  0 127.0.0.1:50370.0.0.0:* 
LISTEN  0  36357   6907/adbd
tcp6   0  0 2001:2d8:ed1c:de1c:bd94:fe5:2d9a:d8e4:35148   
2404:6800:4004:800::2003:80   LAST_ACK0  0   -
tcp6   0  0 2001:2d8:ed1c:de1c:bd94:fe5:2d9a:d8e4:49294   
2404:6800:4005:80c::2004:443  LAST_ACK0  0   -
tcp6   1  0 2001:2d8:ed1c:de1c:bd94:fe5:2d9a:d8e4:35260   
64:ff9b::34d0:9421:80 LAST_ACK0  0   -


-
this is our analysis
when app finished using the socket(tcp session), it calls sock_close.
then tcp_close() makes sk->sk_state to LAST_ACK, and sock to SOCK_DEAD by 
excute sock_orphan().

11-23 11:40:55.676 [5:  Thread-44:11210] TCP: bsj: tcp_set_state: TCP 
sk=ffc8a789c640, in:80092, State Close Wait -> Last ACK, 
[2404:6800:4004:800::2003]
11-23 11:40:55.676 [5:  Thread-44:11210] Call trace:
11-23 11:40:55.676 [5:  Thread-44:11210] [] 
tcp_set_state+0x1b8/0x1f0
11-23 11:40:55.676 [5:  Thread-44:11210] [] 
tcp_close+0x484/0x534
11-23 11:40:55.676 [5:  Thread-44:11210] [] 
inet_release+0x60/0x74
11-23 11:40:55.676 [5:  Thread-44:11210] [] 
inet6_release+0x30/0x48
11-23 11:40:55.676 [5:  Thread-44:11210] [] 
__sock_release+0x40/0x104
11-23 11:40:55.676 [5:  Thread-44:11210] [] 
sock_close+0x18/0x28
11-23 11:40:55.678 [5:  Thread-44:11210] TCP: bsj: sock_orphan: TCP 
sk=ffc8a789c640, in:80092, State Last ACK, [2404:6800:4004:800::2003]


at this point, if the FIN_ACK comes, there's no problem. all is well~
but without that and when turn off wifi,
netd trying to close all the session by calling tcp_abort, sock_diag_destory.

11-23 11:41:38.463 [4:   netd: 5323] TCP: bsj: tcp_abort: SOCK_DEAD!!! 
: TCP sk=ffc8a789c640, in:0, State Last ACK, caller: , 
[2404:6800:4004:800::2003]
11-23 11:41:38.464 [4:   netd: 5323] TCP: bsj: tcp_abort: SOCK_DEAD!!! 
: TCP sk=ffc8a789b840, in:0, State Last ACK, caller: , 
[2404:6800:4005:80c::2004]
11-23 11:41:38.464 [4:   netd: 5323] TCP: bsj: tcp_abort: SOCK_DEAD!!! 
: TCP sk=ffc8a7899c40, in:0, State Last ACK, caller: , 
[64:ff9b::34d0:9421]

but because of this sock was already changed to SOCK_DEAT state by tcp_close(), 
tcp_done() can't be excuted.
so this session can't be closed.

int tcp_abort(struct sock *sk, int err)
{
...
if (!sock_flag(sk, SOCK_DEAD)) { when SOCK_DEAD, tcp_done() 
be skip.
...
sk->sk_error_report(sk);
if (tcp_need_reset(sk->sk_state))

[PATCH net-next 0/4] qed* enhancements series

2018-11-22 Thread Sudarsana Reddy Kalluru
From: Sudarsana Reddy Kalluru 

The patch series add few enhancements to qed/qede drivers.
Please consider applying it to "net-next".

Sudarsana Reddy Kalluru (4):
  qed: Display port_id in the UFP debug messages.
  qede: Simplify the usage of qede-flags.
  qede: Update link status only when interface is ready.
  qed: Add support for MBI upgrade over MFW.

 drivers/net/ethernet/qlogic/qed/qed_hsi.h|  6 +++
 drivers/net/ethernet/qlogic/qed/qed_main.c   | 13 +-
 drivers/net/ethernet/qlogic/qed/qed_mcp.c| 65 +++-
 drivers/net/ethernet/qlogic/qed/qed_mcp.h| 10 -
 drivers/net/ethernet/qlogic/qede/qede.h  | 12 +++--
 drivers/net/ethernet/qlogic/qede/qede_main.c | 10 +++--
 drivers/net/ethernet/qlogic/qede/qede_ptp.c  |  6 +--
 7 files changed, 71 insertions(+), 51 deletions(-)

-- 
1.8.3.1



[PATCH net-next 4/4] qed: Add support for MBI upgrade over MFW.

2018-11-22 Thread Sudarsana Reddy Kalluru
The patch adds driver support for MBI image update through MFW.

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Ariel Elior 
Signed-off-by: Michal Kalderon 
---
 drivers/net/ethernet/qlogic/qed/qed_hsi.h  |  6 
 drivers/net/ethernet/qlogic/qed/qed_main.c | 13 +++--
 drivers/net/ethernet/qlogic/qed/qed_mcp.c  | 45 +++---
 drivers/net/ethernet/qlogic/qed/qed_mcp.h  | 10 ---
 4 files changed, 40 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_hsi.h 
b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
index 5c221eb..7e120b5 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hsi.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
@@ -12655,6 +12655,7 @@ struct public_drv_mb {
 #define DRV_MB_PARAM_DCBX_NOTIFY_MASK  0x00FF
 #define DRV_MB_PARAM_DCBX_NOTIFY_SHIFT 3
 
+#define DRV_MB_PARAM_NVM_PUT_FILE_BEGIN_MBI 0x3
 #define DRV_MB_PARAM_NVM_LEN_OFFSET24
 
 #define DRV_MB_PARAM_CFG_VF_MSIX_VF_ID_SHIFT   0
@@ -12814,6 +12815,11 @@ struct public_drv_mb {
union drv_union_data union_data;
 };
 
+#define FW_MB_PARAM_NVM_PUT_FILE_REQ_OFFSET_MASK   0x00ff
+#define FW_MB_PARAM_NVM_PUT_FILE_REQ_OFFSET_SHIFT  0
+#define FW_MB_PARAM_NVM_PUT_FILE_REQ_SIZE_MASK 0xff00
+#define FW_MB_PARAM_NVM_PUT_FILE_REQ_SIZE_SHIFT24
+
 enum MFW_DRV_MSG_TYPE {
MFW_DRV_MSG_LINK_CHANGE,
MFW_DRV_MSG_FLR_FW_ACK_FAILED,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c 
b/drivers/net/ethernet/qlogic/qed/qed_main.c
index fff7f04..4b3e682 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -1939,21 +1939,30 @@ static int qed_nvm_flash_image_access(struct qed_dev 
*cdev, const u8 **data,
  * 0B  |   0x3 [command index]|
  * 4B  | b'0: check_response?   | b'1-31  reserved|
  * 8B  | File-type |   reserved   |
+ * 12B |Image length in bytes |
  * \--/
  * Start a new file of the provided type
  */
 static int qed_nvm_flash_image_file_start(struct qed_dev *cdev,
  const u8 **data, bool *check_resp)
 {
+   u32 file_type, file_size = 0;
int rc;
 
*data += 4;
*check_resp = !!(**data & BIT(0));
*data += 4;
+   file_type = **data;
 
DP_VERBOSE(cdev, NETIF_MSG_DRV,
-  "About to start a new file of type %02x\n", **data);
-   rc = qed_mcp_nvm_put_file_begin(cdev, **data);
+  "About to start a new file of type %02x\n", file_type);
+   if (file_type == DRV_MB_PARAM_NVM_PUT_FILE_BEGIN_MBI) {
+   *data += 4;
+   file_size = *((u32 *)(*data));
+   }
+
+   rc = qed_mcp_nvm_write(cdev, QED_PUT_FILE_BEGIN, file_type,
+  (u8 *)(_size), 4);
*data += 4;
 
return rc;
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c 
b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index 34ed757..e7f18e3 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -2745,24 +2745,6 @@ int qed_mcp_nvm_resp(struct qed_dev *cdev, u8 *p_buf)
return 0;
 }
 
-int qed_mcp_nvm_put_file_begin(struct qed_dev *cdev, u32 addr)
-{
-   struct qed_hwfn *p_hwfn = QED_LEADING_HWFN(cdev);
-   struct qed_ptt *p_ptt;
-   u32 resp, param;
-   int rc;
-
-   p_ptt = qed_ptt_acquire(p_hwfn);
-   if (!p_ptt)
-   return -EBUSY;
-   rc = qed_mcp_cmd(p_hwfn, p_ptt, DRV_MSG_CODE_NVM_PUT_FILE_BEGIN, addr,
-, );
-   cdev->mcp_nvm_resp = resp;
-   qed_ptt_release(p_hwfn, p_ptt);
-
-   return rc;
-}
-
 int qed_mcp_nvm_write(struct qed_dev *cdev,
  u32 cmd, u32 addr, u8 *p_buf, u32 len)
 {
@@ -2776,6 +2758,9 @@ int qed_mcp_nvm_write(struct qed_dev *cdev,
return -EBUSY;
 
switch (cmd) {
+   case QED_PUT_FILE_BEGIN:
+   nvm_cmd = DRV_MSG_CODE_NVM_PUT_FILE_BEGIN;
+   break;
case QED_PUT_FILE_DATA:
nvm_cmd = DRV_MSG_CODE_NVM_PUT_FILE_DATA;
break;
@@ -2788,10 +2773,14 @@ int qed_mcp_nvm_write(struct qed_dev *cdev,
goto out;
}
 
+   buf_size = min_t(u32, (len - buf_idx), MCP_DRV_NVM_BUF_LEN);
while (buf_idx < len) {
-   buf_size = min_t(u32, (len - buf_idx), MCP_DRV_NVM_BUF_LEN);
-   nvm_offset = ((buf_size << DRV_MB_PARAM_NVM_LEN_OFFSET) |
- addr) + buf_idx;
+   if (cmd == QED_PUT_FILE_BEGIN)
+   nvm_offset = addr;
+   else
+   nvm_offset = ((buf_size <<
+  

[PATCH net-next 3/4] qede: Update link status only when interface is ready.

2018-11-22 Thread Sudarsana Reddy Kalluru
In the case of internal reload (e.g., mtu change), there could be a race
between link-up notification from mfw and the driver unload processing. In
such case kernel assumes the link is up and starts using the queues which
leads to the server crash.

Send link notification to the kernel only when driver has already requested
MFW for the link.

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Ariel Elior 
Signed-off-by: Michal Kalderon 
---
 drivers/net/ethernet/qlogic/qede/qede.h  | 1 +
 drivers/net/ethernet/qlogic/qede/qede_main.c | 8 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede.h 
b/drivers/net/ethernet/qlogic/qede/qede.h
index f8ced12..8c0fe59 100644
--- a/drivers/net/ethernet/qlogic/qede/qede.h
+++ b/drivers/net/ethernet/qlogic/qede/qede.h
@@ -170,6 +170,7 @@ struct qede_rdma_dev {
 
 enum qede_flags_bit {
QEDE_FLAGS_IS_VF = 0,
+   QEDE_FLAGS_LINK_REQUESTED,
QEDE_FLAGS_PTP_TX_IN_PRORGESS,
QEDE_FLAGS_TX_TIMESTAMPING_EN
 };
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 0f1c480..efbb4f3 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -2057,6 +2057,8 @@ static void qede_unload(struct qede_dev *edev, enum 
qede_unload_mode mode,
if (!is_locked)
__qede_lock(edev);
 
+   clear_bit(QEDE_FLAGS_LINK_REQUESTED, >flags);
+
edev->state = QEDE_STATE_CLOSED;
 
qede_rdma_dev_event_close(edev);
@@ -2163,6 +2165,8 @@ static int qede_load(struct qede_dev *edev, enum 
qede_load_mode mode,
/* Program un-configured VLANs */
qede_configure_vlan_filters(edev);
 
+   set_bit(QEDE_FLAGS_LINK_REQUESTED, >flags);
+
/* Ask for link-up using current configuration */
memset(_params, 0, sizeof(link_params));
link_params.link_up = true;
@@ -2258,8 +2262,8 @@ static void qede_link_update(void *dev, struct 
qed_link_output *link)
 {
struct qede_dev *edev = dev;
 
-   if (!netif_running(edev->ndev)) {
-   DP_VERBOSE(edev, NETIF_MSG_LINK, "Interface is not running\n");
+   if (!test_bit(QEDE_FLAGS_LINK_REQUESTED, >flags)) {
+   DP_VERBOSE(edev, NETIF_MSG_LINK, "Interface is not ready\n");
return;
}
 
-- 
1.8.3.1



[PATCH net-next 2/4] qede: Simplify the usage of qede-flags.

2018-11-22 Thread Sudarsana Reddy Kalluru
The values represented by qede->flags is being used in mixed ways:
  1. As 'value' at some places e.g., QEDE_FLAGS_IS_VF usage
  2. As bit-mask(value) at some places e.g., QEDE_FLAGS_PTP_TX_IN_PRORGESS
 usage.
This implementation pose problems in future when we want to add more flag
values e.g., overlap of the values, overflow of 64-bit storage.

Updated the implementation to go with approach (2) for qede->flags.

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Ariel Elior 
Signed-off-by: Michal Kalderon 
---
 drivers/net/ethernet/qlogic/qede/qede.h  | 11 +++
 drivers/net/ethernet/qlogic/qede/qede_main.c |  2 +-
 drivers/net/ethernet/qlogic/qede/qede_ptp.c  |  6 +++---
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede.h 
b/drivers/net/ethernet/qlogic/qede/qede.h
index de98a97..f8ced12 100644
--- a/drivers/net/ethernet/qlogic/qede/qede.h
+++ b/drivers/net/ethernet/qlogic/qede/qede.h
@@ -168,6 +168,12 @@ struct qede_rdma_dev {
 
 #define QEDE_RFS_MAX_FLTR  256
 
+enum qede_flags_bit {
+   QEDE_FLAGS_IS_VF = 0,
+   QEDE_FLAGS_PTP_TX_IN_PRORGESS,
+   QEDE_FLAGS_TX_TIMESTAMPING_EN
+};
+
 struct qede_dev {
struct qed_dev  *cdev;
struct net_device   *ndev;
@@ -177,10 +183,7 @@ struct qede_dev {
u8  dp_level;
 
unsigned long flags;
-#define QEDE_FLAG_IS_VFBIT(0)
-#define IS_VF(edev)(!!((edev)->flags & QEDE_FLAG_IS_VF))
-#define QEDE_TX_TIMESTAMPING_ENBIT(1)
-#define QEDE_FLAGS_PTP_TX_IN_PRORGESS  BIT(2)
+#define IS_VF(edev)(test_bit(QEDE_FLAGS_IS_VF, &(edev)->flags))
 
const struct qed_eth_ops*ops;
struct qede_ptp *ptp;
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 46d0f2e..0f1c480 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -1086,7 +1086,7 @@ static int __qede_probe(struct pci_dev *pdev, u32 
dp_module, u8 dp_level,
}
 
if (is_vf)
-   edev->flags |= QEDE_FLAG_IS_VF;
+   set_bit(QEDE_FLAGS_IS_VF, >flags);
 
qede_init_ndev(edev);
 
diff --git a/drivers/net/ethernet/qlogic/qede/qede_ptp.c 
b/drivers/net/ethernet/qlogic/qede/qede_ptp.c
index 013ff56..5f3f42a 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_ptp.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_ptp.c
@@ -223,12 +223,12 @@ static int qede_ptp_cfg_filters(struct qede_dev *edev)
 
switch (ptp->tx_type) {
case HWTSTAMP_TX_ON:
-   edev->flags |= QEDE_TX_TIMESTAMPING_EN;
+   set_bit(QEDE_FLAGS_TX_TIMESTAMPING_EN, >flags);
tx_type = QED_PTP_HWTSTAMP_TX_ON;
break;
 
case HWTSTAMP_TX_OFF:
-   edev->flags &= ~QEDE_TX_TIMESTAMPING_EN;
+   clear_bit(QEDE_FLAGS_TX_TIMESTAMPING_EN, >flags);
tx_type = QED_PTP_HWTSTAMP_TX_OFF;
break;
 
@@ -518,7 +518,7 @@ void qede_ptp_tx_ts(struct qede_dev *edev, struct sk_buff 
*skb)
if (test_and_set_bit_lock(QEDE_FLAGS_PTP_TX_IN_PRORGESS, >flags))
return;
 
-   if (unlikely(!(edev->flags & QEDE_TX_TIMESTAMPING_EN))) {
+   if (unlikely(!test_bit(QEDE_FLAGS_TX_TIMESTAMPING_EN, >flags))) {
DP_NOTICE(edev,
  "Tx timestamping was not enabled, this packet will 
not be timestamped\n");
} else if (unlikely(ptp->tx_skb)) {
-- 
1.8.3.1



[PATCH net-next 1/4] qed: Display port_id in the UFP debug messages.

2018-11-22 Thread Sudarsana Reddy Kalluru
MFW sends UFP notifications mostly during the device init phase and PFs
might not be assigned with a name by this time. Hence capturing port-id in
the debug messages would help in finding which PF the ufp notification was
sent to.

Also, fixed a minor scemantic issue in a debug print.

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Ariel Elior 
Signed-off-by: Michal Kalderon 
---
 drivers/net/ethernet/qlogic/qed/qed_mcp.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c 
b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index a96364d..34ed757 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -1619,7 +1619,7 @@ static void qed_mcp_update_stag(struct qed_hwfn *p_hwfn, 
struct qed_ptt *p_ptt)
qed_sp_pf_update_stag(p_hwfn);
}
 
-   DP_VERBOSE(p_hwfn, QED_MSG_SP, "ovlan  = %d hw_mode = 0x%x\n",
+   DP_VERBOSE(p_hwfn, QED_MSG_SP, "ovlan = %d hw_mode = 0x%x\n",
   p_hwfn->mcp_info->func_info.ovlan, p_hwfn->hw_info.hw_mode);
 
/* Acknowledge the MFW */
@@ -1641,7 +1641,9 @@ void qed_mcp_read_ufp_config(struct qed_hwfn *p_hwfn, 
struct qed_ptt *p_ptt)
val = (port_cfg & OEM_CFG_CHANNEL_TYPE_MASK) >>
OEM_CFG_CHANNEL_TYPE_OFFSET;
if (val != OEM_CFG_CHANNEL_TYPE_STAGGED)
-   DP_NOTICE(p_hwfn, "Incorrect UFP Channel type  %d\n", val);
+   DP_NOTICE(p_hwfn,
+ "Incorrect UFP Channel type  %d port_id 0x%02x\n",
+ val, MFW_PORT(p_hwfn));
 
val = (port_cfg & OEM_CFG_SCHED_TYPE_MASK) >> OEM_CFG_SCHED_TYPE_OFFSET;
if (val == OEM_CFG_SCHED_TYPE_ETS) {
@@ -1650,7 +1652,9 @@ void qed_mcp_read_ufp_config(struct qed_hwfn *p_hwfn, 
struct qed_ptt *p_ptt)
p_hwfn->ufp_info.mode = QED_UFP_MODE_VNIC_BW;
} else {
p_hwfn->ufp_info.mode = QED_UFP_MODE_UNKNOWN;
-   DP_NOTICE(p_hwfn, "Unknown UFP scheduling mode %d\n", val);
+   DP_NOTICE(p_hwfn,
+ "Unknown UFP scheduling mode %d port_id 0x%02x\n",
+ val, MFW_PORT(p_hwfn));
}
 
qed_mcp_get_shmem_func(p_hwfn, p_ptt, _info, MCP_PF_ID(p_hwfn));
@@ -1665,13 +1669,15 @@ void qed_mcp_read_ufp_config(struct qed_hwfn *p_hwfn, 
struct qed_ptt *p_ptt)
p_hwfn->ufp_info.pri_type = QED_UFP_PRI_OS;
} else {
p_hwfn->ufp_info.pri_type = QED_UFP_PRI_UNKNOWN;
-   DP_NOTICE(p_hwfn, "Unknown Host priority control %d\n", val);
+   DP_NOTICE(p_hwfn,
+ "Unknown Host priority control %d port_id 0x%02x\n",
+ val, MFW_PORT(p_hwfn));
}
 
DP_NOTICE(p_hwfn,
- "UFP shmem config: mode = %d tc = %d pri_type = %d\n",
- p_hwfn->ufp_info.mode,
- p_hwfn->ufp_info.tc, p_hwfn->ufp_info.pri_type);
+ "UFP shmem config: mode = %d tc = %d pri_type = %d port_id 
0x%02x\n",
+ p_hwfn->ufp_info.mode, p_hwfn->ufp_info.tc,
+ p_hwfn->ufp_info.pri_type, MFW_PORT(p_hwfn));
 }
 
 static int
-- 
1.8.3.1



答复: [PATCH][net-next] vhost:net: allocate 32KB memory instead of 32K pages when page frag refill

2018-11-22 Thread Li,Rongqing

On 2018/11/23 上午10:04, Li RongQing wrote:
> >when page frag refills, 32K pages, 128MB memory is asked, it hardly 
> >successes when system has memory stress


> Looking at get_order(), it seems we get 3 after get_order(32768) since it 
> accepts the size of block.

You are right, I understood wrongly, 

Please drop this patch, sorry for the noise

-Q


答复: [PATCH] net: fix the per task frag allocator size

2018-11-22 Thread Li,Rongqing

> get_order(8) returns zero here if I understood it correctly.


You are right, I understood wrongly, 

Please drop this patch, sorry for the noise

-Q






Re: [PATCH][net-next] vhost:net: allocate 32KB memory instead of 32K pages when page frag refill

2018-11-22 Thread Jason Wang



On 2018/11/23 上午10:04, Li RongQing wrote:

when page frag refills, 32K pages, 128MB memory is asked, it
hardly successes when system has memory stress



Looking at get_order(), it seems we get 3 after get_order(32768) since 
it accepts the size of block.


/**
 * get_order - Determine the allocation order of a memory size
 * @size: The size for which to get the order
...

define get_order(n)    \
(   \
    __builtin_constant_p(n) ? ( \
    ((n) == 0UL) ? BITS_PER_LONG - PAGE_SHIFT : \
    (((n) < (1UL << PAGE_SHIFT)) ? 0 :  \
 ilog2((n) - 1) - PAGE_SHIFT + 1)   \

  ^^^

    ) : \
    __get_order(n)  \
)




And such large memory size will cause the underflow of reference
bias, and make refcount of page chaos, since reference bias will
be decreased to negative before the allocated memory is used up



Do you have reproducer for this issue?

Thanks




so 32KB memory is safe choice, meanwhile, remove a unnecessary
check

Fixes: e4dab1e6ea64 ("vhost_net: mitigate page reference counting during page frag 
refill")
Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
  drivers/vhost/net.c | 22 +++---
  1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index d919284f103b..b933a4a8e4ba 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -641,7 +641,7 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t 
total_len)
   !vhost_vq_avail_empty(vq->dev, vq);
  }
  
-#define SKB_FRAG_PAGE_ORDER get_order(32768)

+#define SKB_FRAG_PAGE_ORDER3
  
  static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,

   struct page_frag *pfrag, gfp_t gfp)
@@ -654,17 +654,17 @@ static bool vhost_net_page_frag_refill(struct vhost_net 
*net, unsigned int sz,
  
  	pfrag->offset = 0;

net->refcnt_bias = 0;
-   if (SKB_FRAG_PAGE_ORDER) {
-   /* Avoid direct reclaim but allow kswapd to wake */
-   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
- __GFP_COMP | __GFP_NOWARN |
- __GFP_NORETRY,
- SKB_FRAG_PAGE_ORDER);
-   if (likely(pfrag->page)) {
-   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
-   goto done;
-   }
+
+   /* Avoid direct reclaim but allow kswapd to wake */
+   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
+ __GFP_COMP | __GFP_NOWARN |
+ __GFP_NORETRY,
+ SKB_FRAG_PAGE_ORDER);
+   if (likely(pfrag->page)) {
+   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
+   goto done;
}
+
pfrag->page = alloc_page(gfp);
if (likely(pfrag->page)) {
pfrag->size = PAGE_SIZE;


Re: [PATCH] net: fix the per task frag allocator size

2018-11-22 Thread Yunsheng Lin
On 2018/11/23 10:03, Li RongQing wrote:
> when fill task frag, 32K pages, 128MB memory is asked, it
> hardly successes when system has memory stress
> 
> and commit '5640f7685831 ("net: use a per task frag allocator")'
> said it wants 32768 bytes, not 32768 pages:
> 
>"(up to 32768 bytes per frag, thats order-3 pages on x86)"
> 
> Fixes: 5640f7685831e ("net: use a per task frag allocator")
> Signed-off-by: Zhang Yu 
> Signed-off-by: Li RongQing 
> ---
>  net/core/sock.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 6d7e189e3cd9..e3cbefeedf5c 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2186,8 +2186,8 @@ static void sk_leave_memory_pressure(struct sock *sk)
>   }
>  }
>  
> -/* On 32bit arches, an skb frag is limited to 2^15 */
> -#define SKB_FRAG_PAGE_ORDER  get_order(32768)
> +/* On 32bit arches, an skb frag is limited to 2^15 bytes*/
> +#define SKB_FRAG_PAGE_ORDER  get_order(8)

get_order(8) returns zero here if I understood it correctly.


>  
>  /**
>   * skb_page_frag_refill - check that a page_frag contains enough room
> 



[PATCH][net-next] vhost:net: allocate 32KB memory instead of 32K pages when page frag refill

2018-11-22 Thread Li RongQing
when page frag refills, 32K pages, 128MB memory is asked, it
hardly successes when system has memory stress

And such large memory size will cause the underflow of reference
bias, and make refcount of page chaos, since reference bias will
be decreased to negative before the allocated memory is used up

so 32KB memory is safe choice, meanwhile, remove a unnecessary
check

Fixes: e4dab1e6ea64 ("vhost_net: mitigate page reference counting during page 
frag refill")
Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 drivers/vhost/net.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index d919284f103b..b933a4a8e4ba 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -641,7 +641,7 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t 
total_len)
   !vhost_vq_avail_empty(vq->dev, vq);
 }
 
-#define SKB_FRAG_PAGE_ORDER get_order(32768)
+#define SKB_FRAG_PAGE_ORDER3
 
 static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
   struct page_frag *pfrag, gfp_t gfp)
@@ -654,17 +654,17 @@ static bool vhost_net_page_frag_refill(struct vhost_net 
*net, unsigned int sz,
 
pfrag->offset = 0;
net->refcnt_bias = 0;
-   if (SKB_FRAG_PAGE_ORDER) {
-   /* Avoid direct reclaim but allow kswapd to wake */
-   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
- __GFP_COMP | __GFP_NOWARN |
- __GFP_NORETRY,
- SKB_FRAG_PAGE_ORDER);
-   if (likely(pfrag->page)) {
-   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
-   goto done;
-   }
+
+   /* Avoid direct reclaim but allow kswapd to wake */
+   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
+ __GFP_COMP | __GFP_NOWARN |
+ __GFP_NORETRY,
+ SKB_FRAG_PAGE_ORDER);
+   if (likely(pfrag->page)) {
+   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
+   goto done;
}
+
pfrag->page = alloc_page(gfp);
if (likely(pfrag->page)) {
pfrag->size = PAGE_SIZE;
-- 
2.16.2



[PATCH] net: fix the per task frag allocator size

2018-11-22 Thread Li RongQing
when fill task frag, 32K pages, 128MB memory is asked, it
hardly successes when system has memory stress

and commit '5640f7685831 ("net: use a per task frag allocator")'
said it wants 32768 bytes, not 32768 pages:

 "(up to 32768 bytes per frag, thats order-3 pages on x86)"

Fixes: 5640f7685831e ("net: use a per task frag allocator")
Signed-off-by: Zhang Yu 
Signed-off-by: Li RongQing 
---
 net/core/sock.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 6d7e189e3cd9..e3cbefeedf5c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2186,8 +2186,8 @@ static void sk_leave_memory_pressure(struct sock *sk)
}
 }
 
-/* On 32bit arches, an skb frag is limited to 2^15 */
-#define SKB_FRAG_PAGE_ORDERget_order(32768)
+/* On 32bit arches, an skb frag is limited to 2^15 bytes*/
+#define SKB_FRAG_PAGE_ORDERget_order(8)
 
 /**
  * skb_page_frag_refill - check that a page_frag contains enough room
-- 
2.16.2



Re: [Patch net-next 2/2] net: dump whole skb data in netdev_rx_csum_fault()

2018-11-22 Thread Cong Wang
On Wed, Nov 21, 2018 at 11:33 AM Saeed Mahameed  wrote:
>
> On Wed, 2018-11-21 at 10:26 -0800, Eric Dumazet wrote:
> > On Wed, Nov 21, 2018 at 10:17 AM Cong Wang 
> > wrote:
> > > On Wed, Nov 21, 2018 at 5:05 AM Eric Dumazet <
> > > eric.duma...@gmail.com> wrote:
> > > >
> > > >
> > > > On 11/20/2018 06:13 PM, Cong Wang wrote:
> > > > > Currently, we only dump a few selected skb fields in
> > > > > netdev_rx_csum_fault(). It is not suffient for debugging
> > > > > checksum
> > > > > fault. This patch introduces skb_dump() which dumps skb mac
> > > > > header,
> > > > > network header and its whole skb->data too.
> > > > >
> > > > > Cc: Herbert Xu 
> > > > > Cc: Eric Dumazet 
> > > > > Cc: David Miller 
> > > > > Signed-off-by: Cong Wang 
> > > > > ---
> > > > > + print_hex_dump(level, "skb data: ", DUMP_PREFIX_OFFSET,
> > > > > 16, 1,
> > > > > +skb->data, skb->len, false);
> > > >
> > > > As I mentioned to David, we want all the bytes that were maybe
> > > > already pulled
> > > >
> > > > (skb->head starting point, not skb->data)
> > >
> > > Hmm, with mac header and network header, it is effectively from
> > > skb->head, no?
> > > Is there anything between skb->head and mac header?
> >
> > Oh, I guess we wanted a single hex dump, or we need some user program
> > to be able to
> > rebuild from different memory zones the original CHECKSUM_COMPLETE
> > value.
> >
>
> Normally the driver keeps some headroom @skb->head, so the actual mac
> header starts @ skb->head + driver_specific_headroom

Good to know, but this headroom isn't covered by skb->csum, so
not useful here, right? The skb->csum for mlx5 only covers network
header and its payload.


Re: [Patch net-next 2/2] net: dump whole skb data in netdev_rx_csum_fault()

2018-11-22 Thread Cong Wang
On Wed, Nov 21, 2018 at 10:26 AM Eric Dumazet  wrote:
>
> On Wed, Nov 21, 2018 at 10:17 AM Cong Wang  wrote:
> >
> > On Wed, Nov 21, 2018 at 5:05 AM Eric Dumazet  wrote:
> > >
> > >
> > >
> > > On 11/20/2018 06:13 PM, Cong Wang wrote:
> > > > Currently, we only dump a few selected skb fields in
> > > > netdev_rx_csum_fault(). It is not suffient for debugging checksum
> > > > fault. This patch introduces skb_dump() which dumps skb mac header,
> > > > network header and its whole skb->data too.
> > > >
> > > > Cc: Herbert Xu 
> > > > Cc: Eric Dumazet 
> > > > Cc: David Miller 
> > > > Signed-off-by: Cong Wang 
> > > > ---
> > >
> > >
> > > > + print_hex_dump(level, "skb data: ", DUMP_PREFIX_OFFSET, 16, 1,
> > > > +skb->data, skb->len, false);
> > >
> > > As I mentioned to David, we want all the bytes that were maybe already 
> > > pulled
> > >
> > > (skb->head starting point, not skb->data)
> >
> > Hmm, with mac header and network header, it is effectively from skb->head, 
> > no?
> > Is there anything between skb->head and mac header?
>
> Oh, I guess we wanted a single hex dump, or we need some user program
> to be able to
> rebuild from different memory zones the original CHECKSUM_COMPLETE value.


Yeah, I can remove the prefix and dump the complete packet as
one single block. This means I also need to check where
skb->data points to.

>
> >
> > >
> > > Also we will miss the trimmed bytes if there were padding data.
> > > And it seems the various bugs we have are all tied to the pulled or 
> > > trimmed bytes.
> > >
> >
> > Unless I miss something, the tailing padding data should be in range
> > [iphdr->tot_len, skb->len]. No?
>
>
> Not after we did the pskb_trim_rcsum() call, since it has effectively
> reduced skb->len by the number of padding bytes.

Sure, this patch can't change where netdev_rx_csum_fault() gets
called. We either need to move the checksum validation earlier,
or move the trimming later, none of them belongs to this patch.

Thanks.


Re: [PATCH net-next 00/12] switchdev: Convert switchdev_port_obj_{add,del}() to notifiers

2018-11-22 Thread Petr Machata
Petr Machata  writes:

> An offloading driver may need to have access to switchdev events on
> ports that aren't directly under its control. An example is a VXLAN port
> attached to a bridge offloaded by a driver. The driver needs to know
> about VLANs configured on the VXLAN device. However the VXLAN device
> isn't stashed between the bridge and a front-panel-port device (such as
> is the case e.g. for LAG devices), so the usual switchdev ops don't
> reach the driver.

mlxsw will use these notifications to offload VXLAN devices attached to
a VLAN-aware bridge. The patches are available here should anyone wish
to take a look:

https://github.com/idosch/linux/commits/vxlan

Thanks,
Petr


[PATCH bpf-next 0/3] bpf: add sk_msg helper sk_msg_pop_data

2018-11-22 Thread John Fastabend
After being able to add metadata to messages with sk_msg_push_data we
have also found it useful to be able to "pop" this metadata off before
sending it to applications in some cases. This series adds a new helper
sk_msg_pop_data() and the associated patches to add tests and tools/lib
support.

Thanks!

John Fastabend (3):
  bpf: helper to pop data from messages
  bpf: add msg_pop_data helper to tools
  bpf: test_sockmap, add options for msg_pop_data() helper usage

 include/uapi/linux/bpf.h|  13 +-
 net/core/filter.c   | 169 
 net/ipv4/tcp_bpf.c  |  14 +-
 tools/include/uapi/linux/bpf.h  |  13 +-
 tools/testing/selftests/bpf/bpf_helpers.h   |   2 +
 tools/testing/selftests/bpf/test_sockmap.c  | 127 +-
 tools/testing/selftests/bpf/test_sockmap_kern.h |  70 --
 7 files changed, 386 insertions(+), 22 deletions(-)

-- 
2.7.4



[PATCH bpf-next 2/3] bpf: add msg_pop_data helper to tools

2018-11-22 Thread John Fastabend
Add the necessary header definitions to tools for new
msg_pop_data_helper.

Signed-off-by: John Fastabend 
---
 tools/include/uapi/linux/bpf.h| 13 -
 tools/testing/selftests/bpf/bpf_helpers.h |  2 ++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c1554aa..95cf7a5 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2268,6 +2268,16 @@ union bpf_attr {
  *
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_pop_data(struct sk_msg_buff *msg, u32 start, u32 pop, u64 flags)
+ * Description
+ * Will remove 'pop' bytes from a msg starting at byte 'start'.
+ * This result in ENOMEM errors under certain situations where
+ * a allocation and copy are required due to a full ring buffer.
+ * However, the helper will try to avoid doing the allocation
+ * if possible. Other errors can occur if input parameters are
+ * invalid either do to start byte not being valid part of msg
+ * payload and/or pop value being to large.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2360,7 +2370,8 @@ union bpf_attr {
FN(map_push_elem),  \
FN(map_pop_elem),   \
FN(map_peek_elem),  \
-   FN(msg_push_data),
+   FN(msg_push_data),  \
+   FN(msg_pop_data),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index 686e57c..7b69519 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -113,6 +113,8 @@ static int (*bpf_msg_pull_data)(void *ctx, int start, int 
end, int flags) =
(void *) BPF_FUNC_msg_pull_data;
 static int (*bpf_msg_push_data)(void *ctx, int start, int end, int flags) =
(void *) BPF_FUNC_msg_push_data;
+static int (*bpf_msg_pop_data)(void *ctx, int start, int cut, int flags) =
+   (void *) BPF_FUNC_msg_pop_data;
 static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
(void *) BPF_FUNC_bind;
 static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) =
-- 
2.7.4



[PATCH bpf-next 1/3] bpf: helper to pop data from messages

2018-11-22 Thread John Fastabend
This adds a BPF SK_MSG program helper so that we can pop data from a
msg. We use this to pop metadata from a previous push data call.

Signed-off-by: John Fastabend 
---
 include/uapi/linux/bpf.h |  13 +++-
 net/core/filter.c| 169 +++
 net/ipv4/tcp_bpf.c   |  14 +++-
 3 files changed, 192 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c1554aa..64681f8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2268,6 +2268,16 @@ union bpf_attr {
  *
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_pop_data(struct sk_msg_buff *msg, u32 start, u32 pop, u64 flags)
+ *  Description
+ * Will remove 'pop' bytes from a msg starting at byte 'start'.
+ * This result in ENOMEM errors under certain situations where
+ * a allocation and copy are required due to a full ring buffer.
+ * However, the helper will try to avoid doing the allocation
+ * if possible. Other errors can occur if input parameters are
+ * invalid either do to start byte not being valid part of msg
+ * payload and/or pop value being to large.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2360,7 +2370,8 @@ union bpf_attr {
FN(map_push_elem),  \
FN(map_pop_elem),   \
FN(map_peek_elem),  \
-   FN(msg_push_data),
+   FN(msg_push_data),  \
+   FN(msg_pop_data),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index f6ca38a..c6b35b5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2428,6 +2428,173 @@ static const struct bpf_func_proto 
bpf_msg_push_data_proto = {
.arg4_type  = ARG_ANYTHING,
 };
 
+static void sk_msg_shift_left(struct sk_msg *msg, int i)
+{
+   int prev;
+
+   do {
+   prev = i;
+   sk_msg_iter_var_next(i);
+   msg->sg.data[prev] = msg->sg.data[i];
+   } while (i != msg->sg.end);
+
+   sk_msg_iter_prev(msg, end);
+}
+
+static void sk_msg_shift_right(struct sk_msg *msg, int i)
+{
+   struct scatterlist tmp, sge;
+
+   sk_msg_iter_next(msg, end);
+   sge = sk_msg_elem_cpy(msg, i);
+   sk_msg_iter_var_next(i);
+   tmp = sk_msg_elem_cpy(msg, i);
+
+   while (i != msg->sg.end) {
+   msg->sg.data[i] = sge;
+   sk_msg_iter_var_next(i);
+   sge = tmp;
+   tmp = sk_msg_elem_cpy(msg, i);
+   }
+}
+
+BPF_CALL_4(bpf_msg_pop_data, struct sk_msg *, msg, u32, start,
+  u32, len, u64, flags)
+{
+   u32 i = 0, l, space, offset = 0;
+   u64 last = start + len;
+   int pop;
+
+   if (unlikely(flags))
+   return -EINVAL;
+
+   /* First find the starting scatterlist element */
+   i = msg->sg.start;
+   do {
+   l = sk_msg_elem(msg, i)->length;
+
+   if (start < offset + l)
+   break;
+   offset += l;
+   sk_msg_iter_var_next(i);
+   } while (i != msg->sg.end);
+
+   /* Bounds checks: start and pop must be inside message */
+   if (start >= offset + l || last >= msg->sg.size)
+   return -EINVAL;
+
+   space = MAX_MSG_FRAGS - sk_msg_elem_used(msg);
+
+   pop = len;
+   /* --| offset
+* -| start  |--- len --|
+*
+*  |- a | pop ---|- b |
+*  |__| length
+*
+*
+* a:   region at front of scatter element to save
+* b:   region at back of scatter element to save when length > A + pop
+* pop: region to pop from element, same as input 'pop' here will be
+*  decremented below per iteration.
+*
+* Two top-level cases to handle when start != offset, first B is non
+* zero and second B is zero corresponding to when a pop includes more
+* than one element.
+*
+* Then if B is non-zero AND there is no space allocate space and
+* compact A, B regions into page. If there is space shift ring to
+* the rigth free'ing the next element in ring to place B, leaving
+* A untouched except to reduce length.
+*/
+   if (start != offset) {
+   struct scatterlist *nsge, *sge = sk_msg_elem(msg, i);
+   int a = start;
+   int b = sge->length - pop - a;
+
+   sk_msg_iter_var_next(i);
+
+   if (pop < sge->length - a) {
+   if (space) {
+   sge->length = a;
+   sk_msg_shift_right(msg, i);
+  

[PATCH bpf-next 3/3] bpf: test_sockmap, add options for msg_pop_data()

2018-11-22 Thread John Fastabend
Similar to msg_pull_data and msg_push_data add a set of options to
have msg_pop_data() exercised.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/test_sockmap.c  | 127 +++-
 tools/testing/selftests/bpf/test_sockmap_kern.h |  70 ++---
 2 files changed, 180 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index 622ade0..e85a771 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -79,6 +79,8 @@ int txmsg_start;
 int txmsg_end;
 int txmsg_start_push;
 int txmsg_end_push;
+int txmsg_start_pop;
+int txmsg_pop;
 int txmsg_ingress;
 int txmsg_skb;
 int ktls;
@@ -104,6 +106,8 @@ static const struct option long_options[] = {
{"txmsg_end",   required_argument,  NULL, 'e'},
{"txmsg_start_push", required_argument, NULL, 'p'},
{"txmsg_end_push",   required_argument, NULL, 'q'},
+   {"txmsg_start_pop",  required_argument, NULL, 'w'},
+   {"txmsg_pop",required_argument, NULL, 'x'},
{"txmsg_ingress", no_argument,  _ingress, 1 },
{"txmsg_skb", no_argument,  _skb, 1 },
{"ktls", no_argument,   , 1 },
@@ -473,13 +477,27 @@ static int msg_loop(int fd, int iov_count, int 
iov_length, int cnt,
clock_gettime(CLOCK_MONOTONIC, >end);
} else {
int slct, recvp = 0, recv, max_fd = fd;
+   float total_bytes, txmsg_pop_total;
int fd_flags = O_NONBLOCK;
struct timeval timeout;
-   float total_bytes;
fd_set w;
 
fcntl(fd, fd_flags);
+   /* Account for pop bytes noting each iteration of apply will
+* call msg_pop_data helper so we need to account for this
+* by calculating the number of apply iterations. Note user
+* of the tool can create cases where no data is sent by
+* manipulating pop/push/pull/etc. For example txmsg_apply 1
+* with txmsg_pop 1 will try to apply 1B at a time but each
+* iteration will then pop 1B so no data will ever be sent.
+* This is really only useful for testing edge cases in code
+* paths.
+*/
total_bytes = (float)iov_count * (float)iov_length * (float)cnt;
+   txmsg_pop_total = txmsg_pop;
+   if (txmsg_apply)
+   txmsg_pop_total *= (total_bytes / txmsg_apply);
+   total_bytes -= txmsg_pop_total;
err = clock_gettime(CLOCK_MONOTONIC, >start);
if (err < 0)
perror("recv start time: ");
@@ -488,7 +506,7 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
timeout.tv_sec = 0;
timeout.tv_usec = 30;
} else {
-   timeout.tv_sec = 1;
+   timeout.tv_sec = 3;
timeout.tv_usec = 0;
}
 
@@ -503,7 +521,7 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
goto out_errno;
} else if (!slct) {
if (opt->verbose)
-   fprintf(stderr, "unexpected timeout\n");
+   fprintf(stderr, "unexpected timeout: 
recved %zu/%f pop_total %f\n", s->bytes_recvd, total_bytes, txmsg_pop_total);
errno = -EIO;
clock_gettime(CLOCK_MONOTONIC, >end);
goto out_errno;
@@ -619,7 +637,7 @@ static int sendmsg_test(struct sockmap_options *opt)
iov_count = 1;
err = msg_loop(rx_fd, iov_count, iov_buf,
   cnt, , false, opt);
-   if (err && opt->verbose)
+   if (opt->verbose)
fprintf(stderr,
"msg_loop_rx: iov_count %i iov_buf %i cnt %i 
err %i\n",
iov_count, iov_buf, cnt, err);
@@ -931,6 +949,39 @@ static int run_options(struct sockmap_options *options, 
int cg_fd,  int test)
}
}
 
+   if (txmsg_start_pop) {
+   i = 4;
+   err = bpf_map_update_elem(map_fd[5],
+ , _start_pop, 
BPF_ANY);
+   if (err) {
+   fprintf(stderr,
+   "ERROR: bpf_map_update_elem %i@%i 
(txmsg_start_pop):  %d (%s)\n",
+   txmsg_start_pop, i, err, 
strerror(errno));
+

Re: [PATCH v2 bpf-next] bpf: add skb->tstamp r/w access from tc clsact and cg skb progs

2018-11-22 Thread Alexei Starovoitov
On Thu, Nov 22, 2018 at 02:39:16PM -0500, Vlad Dumitrescu wrote:
> This could be used to rate limit egress traffic in concert with a qdisc
> which supports Earliest Departure Time, such as FQ.
> 
> Write access from cg skb progs only with CAP_SYS_ADMIN, since the value
> will be used by downstream qdiscs. It might make sense to relax this.
> 
> Changes v1 -> v2:
>   - allow access from cg skb, write only with CAP_SYS_ADMIN
> 
> Signed-off-by: Vlad Dumitrescu 

Applied to bpf-next.
I copied Eric's and Willem's Acks from v1, since v2 is essentially the same.
Thanks everyone!



[PATCH net-next 12/12] rocker, dsa, ethsw: Don't filter VLAN events on bridge itself

2018-11-22 Thread Petr Machata
Due to an explicit check in rocker_world_port_obj_vlan_add(),
dsa_slave_switchdev_event() resp. port_switchdev_event(), VLAN objects
that are added to a device that is not a front-panel port device are
ignored. Therefore this check is immaterial.

Signed-off-by: Petr Machata 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/rocker/rocker_main.c | 3 ---
 drivers/staging/fsl-dpaa2/ethsw/ethsw.c   | 3 ---
 net/dsa/port.c| 3 ---
 3 files changed, 9 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker_main.c 
b/drivers/net/ethernet/rocker/rocker_main.c
index f05d5c1341b6..6213827e3956 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -1632,9 +1632,6 @@ rocker_world_port_obj_vlan_add(struct rocker_port 
*rocker_port,
 {
struct rocker_world_ops *wops = rocker_port->rocker->wops;
 
-   if (netif_is_bridge_master(vlan->obj.orig_dev))
-   return -EOPNOTSUPP;
-
if (!wops->port_obj_vlan_add)
return -EOPNOTSUPP;
 
diff --git a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c 
b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
index 06a233c7cdd3..4fa37d6e598b 100644
--- a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
+++ b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
@@ -719,9 +719,6 @@ static int port_vlans_add(struct net_device *netdev,
struct ethsw_port_priv *port_priv = netdev_priv(netdev);
int vid, err = 0;
 
-   if (netif_is_bridge_master(vlan->obj.orig_dev))
-   return -EOPNOTSUPP;
-
if (switchdev_trans_ph_prepare(trans))
return 0;
 
diff --git a/net/dsa/port.c b/net/dsa/port.c
index ed0595459df1..2d7e01b23572 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -252,9 +252,6 @@ int dsa_port_vlan_add(struct dsa_port *dp,
.vlan = vlan,
};
 
-   if (netif_is_bridge_master(vlan->obj.orig_dev))
-   return -EOPNOTSUPP;
-
if (br_vlan_enabled(dp->bridge_dev))
return dsa_port_notify(dp, DSA_NOTIFIER_VLAN_ADD, );
 
-- 
2.4.11



[PATCH net-next 11/12] switchdev: Replace port obj add/del SDO with a notification

2018-11-22 Thread Petr Machata
Drop switchdev_ops.switchdev_port_obj_add and _del. Drop the uses of
this field from all clients, which were migrated to use switchdev
notification in the previous patches.

Add a new function switchdev_port_obj_notify() that sends the switchdev
notifications SWITCHDEV_PORT_OBJ_ADD and _DEL.

Update switchdev_port_obj_del_now() to dispatch to this new function.
Drop __switchdev_port_obj_add() and update switchdev_port_obj_add()
likewise.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 .../ethernet/mellanox/mlxsw/spectrum_switchdev.c   |  2 -
 drivers/net/ethernet/mscc/ocelot.c |  2 -
 drivers/net/ethernet/rocker/rocker_main.c  |  2 -
 drivers/staging/fsl-dpaa2/ethsw/ethsw.c|  2 -
 include/net/switchdev.h|  9 ---
 net/dsa/slave.c|  2 -
 net/switchdev/switchdev.c  | 67 --
 7 files changed, 25 insertions(+), 61 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index 3756aaecd39c..73e5db176d7e 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -1968,8 +1968,6 @@ static struct mlxsw_sp_port *mlxsw_sp_lag_rep_port(struct 
mlxsw_sp *mlxsw_sp,
 static const struct switchdev_ops mlxsw_sp_port_switchdev_ops = {
.switchdev_port_attr_get= mlxsw_sp_port_attr_get,
.switchdev_port_attr_set= mlxsw_sp_port_attr_set,
-   .switchdev_port_obj_add = mlxsw_sp_port_obj_add,
-   .switchdev_port_obj_del = mlxsw_sp_port_obj_del,
 };
 
 static int
diff --git a/drivers/net/ethernet/mscc/ocelot.c 
b/drivers/net/ethernet/mscc/ocelot.c
index 01403b530522..7f8da8873a96 100644
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@@ -1337,8 +1337,6 @@ static int ocelot_port_obj_del(struct net_device *dev,
 static const struct switchdev_ops ocelot_port_switchdev_ops = {
.switchdev_port_attr_get= ocelot_port_attr_get,
.switchdev_port_attr_set= ocelot_port_attr_set,
-   .switchdev_port_obj_add = ocelot_port_obj_add,
-   .switchdev_port_obj_del = ocelot_port_obj_del,
 };
 
 static int ocelot_port_bridge_join(struct ocelot_port *ocelot_port,
diff --git a/drivers/net/ethernet/rocker/rocker_main.c 
b/drivers/net/ethernet/rocker/rocker_main.c
index 806ffe1d906e..f05d5c1341b6 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -2145,8 +2145,6 @@ static int rocker_port_obj_del(struct net_device *dev,
 static const struct switchdev_ops rocker_port_switchdev_ops = {
.switchdev_port_attr_get= rocker_port_attr_get,
.switchdev_port_attr_set= rocker_port_attr_set,
-   .switchdev_port_obj_add = rocker_port_obj_add,
-   .switchdev_port_obj_del = rocker_port_obj_del,
 };
 
 struct rocker_fib_event_work {
diff --git a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c 
b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
index 83e1d92dc7f3..06a233c7cdd3 100644
--- a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
+++ b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
@@ -930,8 +930,6 @@ static int swdev_port_obj_del(struct net_device *netdev,
 static const struct switchdev_ops ethsw_port_switchdev_ops = {
.switchdev_port_attr_get= swdev_port_attr_get,
.switchdev_port_attr_set= swdev_port_attr_set,
-   .switchdev_port_obj_add = swdev_port_obj_add,
-   .switchdev_port_obj_del = swdev_port_obj_del,
 };
 
 /* For the moment, only flood setting needs to be updated */
diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 6dc7de576167..866b6d148b77 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -121,10 +121,6 @@ typedef int switchdev_obj_dump_cb_t(struct switchdev_obj 
*obj);
  * @switchdev_port_attr_get: Get a port attribute (see switchdev_attr).
  *
  * @switchdev_port_attr_set: Set a port attribute (see switchdev_attr).
- *
- * @switchdev_port_obj_add: Add an object to port (see switchdev_obj_*).
- *
- * @switchdev_port_obj_del: Delete an object from port (see switchdev_obj_*).
  */
 struct switchdev_ops {
int (*switchdev_port_attr_get)(struct net_device *dev,
@@ -132,11 +128,6 @@ struct switchdev_ops {
int (*switchdev_port_attr_set)(struct net_device *dev,
   const struct switchdev_attr *attr,
   struct switchdev_trans *trans);
-   int (*switchdev_port_obj_add)(struct net_device *dev,
- const struct switchdev_obj *obj,
- struct switchdev_trans *trans);
-   int (*switchdev_port_obj_del)(struct net_device *dev,
- 

[PATCH net-next 10/12] ocelot: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL

2018-11-22 Thread Petr Machata
Following patches will change the way of distributing port object
changes from a switchdev operation to a switchdev notifier. The
switchdev code currently recursively descends through layers of lower
devices, eventually calling the op on a front-panel port device. The
notifier will instead be sent referencing the bridge port device, which
may be a stacking device that's one of front-panel ports uppers, or a
completely unrelated device.

Dispatch the new events to ocelot_port_obj_add() resp. _del() to
maintain the same behavior that the switchdev operation based code
currently has. Pass through switchdev_handle_port_obj_add() / _del() to
handle the recursive descend, because Ocelot supports LAG uppers.

Register to the new switchdev blocking notifier chain to get the new
events when they start getting distributed.

Signed-off-by: Petr Machata 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/mscc/ocelot.c   | 28 
 drivers/net/ethernet/mscc/ocelot.h   |  1 +
 drivers/net/ethernet/mscc/ocelot_board.c |  3 +++
 3 files changed, 32 insertions(+)

diff --git a/drivers/net/ethernet/mscc/ocelot.c 
b/drivers/net/ethernet/mscc/ocelot.c
index 3238b9ee42f3..01403b530522 100644
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@@ -1595,6 +1595,34 @@ struct notifier_block ocelot_netdevice_nb __read_mostly 
= {
 };
 EXPORT_SYMBOL(ocelot_netdevice_nb);
 
+static int ocelot_switchdev_blocking_event(struct notifier_block *unused,
+  unsigned long event, void *ptr)
+{
+   struct net_device *dev = switchdev_notifier_info_to_dev(ptr);
+   int err;
+
+   switch (event) {
+   /* Blocking events. */
+   case SWITCHDEV_PORT_OBJ_ADD:
+   err = switchdev_handle_port_obj_add(dev, ptr,
+   ocelot_netdevice_dev_check,
+   ocelot_port_obj_add);
+   return notifier_from_errno(err);
+   case SWITCHDEV_PORT_OBJ_DEL:
+   err = switchdev_handle_port_obj_del(dev, ptr,
+   ocelot_netdevice_dev_check,
+   ocelot_port_obj_del);
+   return notifier_from_errno(err);
+   }
+
+   return NOTIFY_DONE;
+}
+
+struct notifier_block ocelot_switchdev_blocking_nb __read_mostly = {
+   .notifier_call = ocelot_switchdev_blocking_event,
+};
+EXPORT_SYMBOL(ocelot_switchdev_blocking_nb);
+
 int ocelot_probe_port(struct ocelot *ocelot, u8 port,
  void __iomem *regs,
  struct phy_device *phy)
diff --git a/drivers/net/ethernet/mscc/ocelot.h 
b/drivers/net/ethernet/mscc/ocelot.h
index 62c7c8eb00d9..086775f7b52f 100644
--- a/drivers/net/ethernet/mscc/ocelot.h
+++ b/drivers/net/ethernet/mscc/ocelot.h
@@ -499,5 +499,6 @@ int ocelot_probe_port(struct ocelot *ocelot, u8 port,
  struct phy_device *phy);
 
 extern struct notifier_block ocelot_netdevice_nb;
+extern struct notifier_block ocelot_switchdev_blocking_nb;
 
 #endif
diff --git a/drivers/net/ethernet/mscc/ocelot_board.c 
b/drivers/net/ethernet/mscc/ocelot_board.c
index 4c23d18bbf44..ca3ea2fbfcd0 100644
--- a/drivers/net/ethernet/mscc/ocelot_board.c
+++ b/drivers/net/ethernet/mscc/ocelot_board.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ocelot.h"
 
@@ -328,6 +329,7 @@ static int mscc_ocelot_probe(struct platform_device *pdev)
}
 
register_netdevice_notifier(_netdevice_nb);
+   register_switchdev_blocking_notifier(_switchdev_blocking_nb);
 
dev_info(>dev, "Ocelot switch probed\n");
 
@@ -342,6 +344,7 @@ static int mscc_ocelot_remove(struct platform_device *pdev)
struct ocelot *ocelot = platform_get_drvdata(pdev);
 
ocelot_deinit(ocelot);
+   unregister_switchdev_blocking_notifier(_switchdev_blocking_nb);
unregister_netdevice_notifier(_netdevice_nb);
 
return 0;
-- 
2.4.11



[PATCH net-next 09/12] mlxsw: spectrum_switchdev: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL

2018-11-22 Thread Petr Machata
Following patches will change the way of distributing port object
changes from a switchdev operation to a switchdev notifier. The
switchdev code currently recursively descends through layers of lower
devices, eventually calling the op on a front-panel port device. The
notifier will instead be sent referencing the bridge port device, which
may be a stacking device that's one of front-panel ports uppers, or a
completely unrelated device.

To handle SWITCHDEV_PORT_OBJ_ADD and _DEL, subscribe to the blocking
notifier chain. Dispatch to mlxsw_sp_port_obj_add() resp. _del() to
maintain the behavior that the switchdev operation based code currently
has. Defer to switchdev_handle_port_obj_add() / _del() to handle the
recursive descend, because mlxsw supports a number of upper types.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 .../ethernet/mellanox/mlxsw/spectrum_switchdev.c   | 45 +-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index b32a5ee57fb9..3756aaecd39c 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -3118,6 +3118,32 @@ static struct notifier_block mlxsw_sp_switchdev_notifier 
= {
.notifier_call = mlxsw_sp_switchdev_event,
 };
 
+static int mlxsw_sp_switchdev_blocking_event(struct notifier_block *unused,
+unsigned long event, void *ptr)
+{
+   struct net_device *dev = switchdev_notifier_info_to_dev(ptr);
+   int err;
+
+   switch (event) {
+   case SWITCHDEV_PORT_OBJ_ADD:
+   err = switchdev_handle_port_obj_add(dev, ptr,
+   mlxsw_sp_port_dev_check,
+   mlxsw_sp_port_obj_add);
+   return notifier_from_errno(err);
+   case SWITCHDEV_PORT_OBJ_DEL:
+   err = switchdev_handle_port_obj_del(dev, ptr,
+   mlxsw_sp_port_dev_check,
+   mlxsw_sp_port_obj_del);
+   return notifier_from_errno(err);
+   }
+
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block mlxsw_sp_switchdev_blocking_notifier = {
+   .notifier_call = mlxsw_sp_switchdev_blocking_event,
+};
+
 u8
 mlxsw_sp_bridge_port_stp_state(struct mlxsw_sp_bridge_port *bridge_port)
 {
@@ -3127,6 +3153,7 @@ mlxsw_sp_bridge_port_stp_state(struct 
mlxsw_sp_bridge_port *bridge_port)
 static int mlxsw_sp_fdb_init(struct mlxsw_sp *mlxsw_sp)
 {
struct mlxsw_sp_bridge *bridge = mlxsw_sp->bridge;
+   struct notifier_block *nb;
int err;
 
err = mlxsw_sp_ageing_set(mlxsw_sp, MLXSW_SP_DEFAULT_AGEING_TIME);
@@ -3141,17 +3168,33 @@ static int mlxsw_sp_fdb_init(struct mlxsw_sp *mlxsw_sp)
return err;
}
 
+   nb = _sp_switchdev_blocking_notifier;
+   err = register_switchdev_blocking_notifier(nb);
+   if (err) {
+   dev_err(mlxsw_sp->bus_info->dev, "Failed to register switchdev 
blocking notifier\n");
+   goto err_register_switchdev_blocking_notifier;
+   }
+
INIT_DELAYED_WORK(>fdb_notify.dw, mlxsw_sp_fdb_notify_work);
bridge->fdb_notify.interval = MLXSW_SP_DEFAULT_LEARNING_INTERVAL;
mlxsw_sp_fdb_notify_work_schedule(mlxsw_sp);
return 0;
+
+err_register_switchdev_blocking_notifier:
+   unregister_switchdev_notifier(_sp_switchdev_notifier);
+   return err;
 }
 
 static void mlxsw_sp_fdb_fini(struct mlxsw_sp *mlxsw_sp)
 {
+   struct notifier_block *nb;
+
cancel_delayed_work_sync(_sp->bridge->fdb_notify.dw);
-   unregister_switchdev_notifier(_sp_switchdev_notifier);
 
+   nb = _sp_switchdev_blocking_notifier;
+   unregister_switchdev_blocking_notifier(nb);
+
+   unregister_switchdev_notifier(_sp_switchdev_notifier);
 }
 
 int mlxsw_sp_switchdev_init(struct mlxsw_sp *mlxsw_sp)
-- 
2.4.11



[PATCH net-next 08/12] switchdev: Add helpers to aid traversal through lower devices

2018-11-22 Thread Petr Machata
After the transition from switchdev operations to notifier chain (which
will take place in following patches), the onus is on the driver to find
its own devices below possible layer of LAG or other uppers.

The logic to do so is fairly repetitive: each driver is looking for its
own devices among the lowers of the notified device. For those that it
finds, it calls a handler. To indicate that the event was handled,
struct switchdev_notifier_port_obj_info.handled is set. The differences
lie only in what constitutes an "own" device and what handler to call.

Therefore abstract this logic into two helpers,
switchdev_handle_port_obj_add() and switchdev_handle_port_obj_del(). If
a driver only supports physical ports under a bridge device, it will
simply avoid this layer of indirection.

One area where this helper diverges from the current switchdev behavior
is the case of mixed lowers, some of which are switchdev ports and some
of which are not. Previously, such scenario would fail with -EOPNOTSUPP.
The helper could do that for lowers for which the passed-in predicate
doesn't hold. That would however break the case that switchdev ports
from several different drivers are stashed under one master, a scenario
that switchdev currently happily supports. Therefore tolerate any and
all unknown netdevices, whether they are backed by a switchdev driver
or not.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 include/net/switchdev.h   |  33 +++
 net/switchdev/switchdev.c | 100 ++
 2 files changed, 133 insertions(+)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index a2f3ebf39301..6dc7de576167 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -210,6 +210,18 @@ void switchdev_port_fwd_mark_set(struct net_device *dev,
 bool switchdev_port_same_parent_id(struct net_device *a,
   struct net_device *b);
 
+int switchdev_handle_port_obj_add(struct net_device *dev,
+   struct switchdev_notifier_port_obj_info *port_obj_info,
+   bool (*check_cb)(const struct net_device *dev),
+   int (*add_cb)(struct net_device *dev,
+ const struct switchdev_obj *obj,
+ struct switchdev_trans *trans));
+int switchdev_handle_port_obj_del(struct net_device *dev,
+   struct switchdev_notifier_port_obj_info *port_obj_info,
+   bool (*check_cb)(const struct net_device *dev),
+   int (*del_cb)(struct net_device *dev,
+ const struct switchdev_obj *obj));
+
 #define SWITCHDEV_SET_OPS(netdev, ops) ((netdev)->switchdev_ops = (ops))
 #else
 
@@ -284,6 +296,27 @@ static inline bool switchdev_port_same_parent_id(struct 
net_device *a,
return false;
 }
 
+static inline int
+switchdev_handle_port_obj_add(struct net_device *dev,
+   struct switchdev_notifier_port_obj_info *port_obj_info,
+   bool (*check_cb)(const struct net_device *dev),
+   int (*add_cb)(struct net_device *dev,
+ const struct switchdev_obj *obj,
+ struct switchdev_trans *trans))
+{
+   return 0;
+}
+
+static inline int
+switchdev_handle_port_obj_del(struct net_device *dev,
+   struct switchdev_notifier_port_obj_info *port_obj_info,
+   bool (*check_cb)(const struct net_device *dev),
+   int (*del_cb)(struct net_device *dev,
+ const struct switchdev_obj *obj))
+{
+   return 0;
+}
+
 #define SWITCHDEV_SET_OPS(netdev, ops) do {} while (0)
 
 #endif
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index e109bb97ce3f..099434ec7996 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -621,3 +621,103 @@ bool switchdev_port_same_parent_id(struct net_device *a,
return netdev_phys_item_id_same(_attr.u.ppid, _attr.u.ppid);
 }
 EXPORT_SYMBOL_GPL(switchdev_port_same_parent_id);
+
+static int __switchdev_handle_port_obj_add(struct net_device *dev,
+   struct switchdev_notifier_port_obj_info *port_obj_info,
+   bool (*check_cb)(const struct net_device *dev),
+   int (*add_cb)(struct net_device *dev,
+ const struct switchdev_obj *obj,
+ struct switchdev_trans *trans))
+{
+   struct net_device *lower_dev;
+   struct list_head *iter;
+   int err = -EOPNOTSUPP;
+
+   if (check_cb(dev)) {
+   /* This flag is only checked if the return value is success. */
+   port_obj_info->handled = true;
+   return add_cb(dev, port_obj_info->obj, port_obj_info->trans);
+   }
+
+   /* Switch 

[PATCH net-next 07/12] staging: fsl-dpaa2: ethsw: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL

2018-11-22 Thread Petr Machata
Following patches will change the way of distributing port object
changes from a switchdev operation to a switchdev notifier. The
switchdev code currently recursively descends through layers of lower
devices, eventually calling the op on a front-panel port device. The
notifier will instead be sent referencing the bridge port device, which
may be a stacking device that's one of front-panel ports uppers, or a
completely unrelated device.

ethsw currently doesn't support any uppers other than bridge.
SWITCHDEV_OBJ_ID_HOST_MDB and _PORT_MDB objects are always notified on
the bridge port device. Thus the only case that a stacked device could
be validly referenced by port object notifications are bridge
notifications for VLAN objects added to the bridge itself. But the
driver explicitly rejects such notifications in port_vlans_add(). It is
therefore safe to assume that the only interesting case is that the
notification is on a front-panel port netdevice.

To handle SWITCHDEV_PORT_OBJ_ADD and _DEL, subscribe to the blocking
notifier chain. Dispatch to swdev_port_obj_add() resp. _del() to
maintain the behavior that the switchdev operation based code currently
has.

Signed-off-by: Petr Machata 
Acked-by: Jiri Pirko 
---
 drivers/staging/fsl-dpaa2/ethsw/ethsw.c | 56 +
 1 file changed, 56 insertions(+)

diff --git a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c 
b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
index e379b0fa936f..83e1d92dc7f3 100644
--- a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
+++ b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
@@ -1088,10 +1088,51 @@ static int port_switchdev_event(struct notifier_block 
*unused,
return NOTIFY_BAD;
 }
 
+static int
+ethsw_switchdev_port_obj_event(unsigned long event, struct net_device *netdev,
+   struct switchdev_notifier_port_obj_info *port_obj_info)
+{
+   int err = -EOPNOTSUPP;
+
+   switch (event) {
+   case SWITCHDEV_PORT_OBJ_ADD:
+   err = swdev_port_obj_add(netdev, port_obj_info->obj,
+port_obj_info->trans);
+   break;
+   case SWITCHDEV_PORT_OBJ_DEL:
+   err = swdev_port_obj_del(netdev, port_obj_info->obj);
+   break;
+   }
+
+   port_obj_info->handled = true;
+   return notifier_from_errno(err);
+}
+
+static int port_switchdev_blocking_event(struct notifier_block *unused,
+unsigned long event, void *ptr)
+{
+   struct net_device *dev = switchdev_notifier_info_to_dev(ptr);
+
+   if (!ethsw_port_dev_check(dev))
+   return NOTIFY_DONE;
+
+   switch (event) {
+   case SWITCHDEV_PORT_OBJ_ADD: /* fall through */
+   case SWITCHDEV_PORT_OBJ_DEL:
+   return ethsw_switchdev_port_obj_event(event, dev, ptr);
+   }
+
+   return NOTIFY_DONE;
+}
+
 static struct notifier_block port_switchdev_nb = {
.notifier_call = port_switchdev_event,
 };
 
+static struct notifier_block port_switchdev_blocking_nb = {
+   .notifier_call = port_switchdev_blocking_event,
+};
+
 static int ethsw_register_notifier(struct device *dev)
 {
int err;
@@ -1108,8 +1149,16 @@ static int ethsw_register_notifier(struct device *dev)
goto err_switchdev_nb;
}
 
+   err = register_switchdev_blocking_notifier(_switchdev_blocking_nb);
+   if (err) {
+   dev_err(dev, "Failed to register switchdev blocking 
notifier\n");
+   goto err_switchdev_blocking_nb;
+   }
+
return 0;
 
+err_switchdev_blocking_nb:
+   unregister_switchdev_notifier(_switchdev_nb);
 err_switchdev_nb:
unregister_netdevice_notifier(_nb);
return err;
@@ -1296,8 +1345,15 @@ static int ethsw_port_init(struct ethsw_port_priv 
*port_priv, u16 port)
 
 static void ethsw_unregister_notifier(struct device *dev)
 {
+   struct notifier_block *nb;
int err;
 
+   nb = _switchdev_blocking_nb;
+   err = unregister_switchdev_blocking_notifier(nb);
+   if (err)
+   dev_err(dev,
+   "Failed to unregister switchdev blocking notifier 
(%d)\n", err);
+
err = unregister_switchdev_notifier(_switchdev_nb);
if (err)
dev_err(dev,
-- 
2.4.11



[PATCH net-next 04/12] rocker: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL

2018-11-22 Thread Petr Machata
Following patches will change the way of distributing port object
changes from a switchdev operation to a switchdev notifier. The
switchdev code currently recursively descends through layers of lower
devices, eventually calling the op on a front-panel port device. The
notifier will instead be sent referencing the bridge port device, which
may be a stacking device that's one of front-panel ports uppers, or a
completely unrelated device.

rocker currently doesn't support any uppers other than bridge. Thus the
only case that a stacked device could be validly referenced by port
object notifications are bridge notifications for VLAN objects added to
the bridge itself. But the driver explicitly rejects such notifications
in rocker_world_port_obj_vlan_add(). It is therefore safe to assume that
the only interesting case is that the notification is on a front-panel
port netdevice.

Subscribe to the blocking notifier chain. In the handler, filter out
notifications on any foreign netdevices. Dispatch the new notifiers to
rocker_port_obj_add() resp. _del() to maintain the behavior that the
switchdev operation based code currently has.

Signed-off-by: Petr Machata 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/rocker/rocker_main.c | 55 +++
 1 file changed, 55 insertions(+)

diff --git a/drivers/net/ethernet/rocker/rocker_main.c 
b/drivers/net/ethernet/rocker/rocker_main.c
index beb06628f22d..806ffe1d906e 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -2812,12 +2812,54 @@ static int rocker_switchdev_event(struct notifier_block 
*unused,
return NOTIFY_DONE;
 }
 
+static int
+rocker_switchdev_port_obj_event(unsigned long event, struct net_device *netdev,
+   struct switchdev_notifier_port_obj_info *port_obj_info)
+{
+   int err = -EOPNOTSUPP;
+
+   switch (event) {
+   case SWITCHDEV_PORT_OBJ_ADD:
+   err = rocker_port_obj_add(netdev, port_obj_info->obj,
+ port_obj_info->trans);
+   break;
+   case SWITCHDEV_PORT_OBJ_DEL:
+   err = rocker_port_obj_del(netdev, port_obj_info->obj);
+   break;
+   }
+
+   port_obj_info->handled = true;
+   return notifier_from_errno(err);
+}
+
+static int rocker_switchdev_blocking_event(struct notifier_block *unused,
+  unsigned long event, void *ptr)
+{
+   struct net_device *dev = switchdev_notifier_info_to_dev(ptr);
+
+   if (!rocker_port_dev_check(dev))
+   return NOTIFY_DONE;
+
+   switch (event) {
+   case SWITCHDEV_PORT_OBJ_ADD:
+   case SWITCHDEV_PORT_OBJ_DEL:
+   return rocker_switchdev_port_obj_event(event, dev, ptr);
+   }
+
+   return NOTIFY_DONE;
+}
+
 static struct notifier_block rocker_switchdev_notifier = {
.notifier_call = rocker_switchdev_event,
 };
 
+static struct notifier_block rocker_switchdev_blocking_notifier = {
+   .notifier_call = rocker_switchdev_blocking_event,
+};
+
 static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
+   struct notifier_block *nb;
struct rocker *rocker;
int err;
 
@@ -2933,6 +2975,13 @@ static int rocker_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
goto err_register_switchdev_notifier;
}
 
+   nb = _switchdev_blocking_notifier;
+   err = register_switchdev_blocking_notifier(nb);
+   if (err) {
+   dev_err(>dev, "Failed to register switchdev blocking 
notifier\n");
+   goto err_register_switchdev_blocking_notifier;
+   }
+
rocker->hw.id = rocker_read64(rocker, SWITCH_ID);
 
dev_info(>dev, "Rocker switch with id %*phN\n",
@@ -2940,6 +2989,8 @@ static int rocker_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
 
return 0;
 
+err_register_switchdev_blocking_notifier:
+   unregister_switchdev_notifier(_switchdev_notifier);
 err_register_switchdev_notifier:
unregister_fib_notifier(>fib_nb);
 err_register_fib_notifier:
@@ -2971,6 +3022,10 @@ static int rocker_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
 static void rocker_remove(struct pci_dev *pdev)
 {
struct rocker *rocker = pci_get_drvdata(pdev);
+   struct notifier_block *nb;
+
+   nb = _switchdev_blocking_notifier;
+   unregister_switchdev_blocking_notifier(nb);
 
unregister_switchdev_notifier(_switchdev_notifier);
unregister_fib_notifier(>fib_nb);
-- 
2.4.11



[PATCH net-next 06/12] staging: fsl-dpaa2: ethsw: Introduce ethsw_port_dev_check()

2018-11-22 Thread Petr Machata
ethsw currently uses an open-coded comparison of netdev_ops to determine
whether whether a device represents a front panel port. Wrap this into a
named function to simplify reuse.

Signed-off-by: Petr Machata 
Acked-by: Jiri Pirko 
---
 drivers/staging/fsl-dpaa2/ethsw/ethsw.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c 
b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
index 7a7ca67822c5..e379b0fa936f 100644
--- a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
+++ b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
@@ -972,6 +972,11 @@ static int port_bridge_leave(struct net_device *netdev)
return err;
 }
 
+static bool ethsw_port_dev_check(const struct net_device *netdev)
+{
+   return netdev->netdev_ops == _port_ops;
+}
+
 static int port_netdevice_event(struct notifier_block *unused,
unsigned long event, void *ptr)
 {
@@ -980,7 +985,7 @@ static int port_netdevice_event(struct notifier_block 
*unused,
struct net_device *upper_dev;
int err = 0;
 
-   if (netdev->netdev_ops != _port_ops)
+   if (!ethsw_port_dev_check(netdev))
return NOTIFY_DONE;
 
/* Handle just upper dev link/unlink for the moment */
-- 
2.4.11



[PATCH net-next 05/12] net: dsa: slave: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL

2018-11-22 Thread Petr Machata
Following patches will change the way of distributing port object
changes from a switchdev operation to a switchdev notifier. The
switchdev code currently recursively descends through layers of lower
devices, eventually calling the op on a front-panel port device. The
notifier will instead be sent referencing the bridge port device, which
may be a stacking device that's one of front-panel ports uppers, or a
completely unrelated device.

DSA currently doesn't support any other uppers than bridge.
SWITCHDEV_OBJ_ID_HOST_MDB and _PORT_MDB objects are always notified on
the bridge port device. Thus the only case that a stacked device could
be validly referenced by port object notifications are bridge
notifications for VLAN objects added to the bridge itself. But the
driver explicitly rejects such notifications in dsa_port_vlan_add(). It
is therefore safe to assume that the only interesting case is that the
notification is on a front-panel port netdevice. Therefore keep the
filtering by dsa_slave_dev_check() in place.

To handle SWITCHDEV_PORT_OBJ_ADD and _DEL, subscribe to the blocking
notifier chain. Dispatch to rocker_port_obj_add() resp. _del() to
maintain the behavior that the switchdev operation based code currently
has.

Signed-off-by: Petr Machata 
Acked-by: Jiri Pirko 
---
 net/dsa/slave.c | 56 
 1 file changed, 56 insertions(+)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 7d0c19e7edcf..d00a0b6d4ce0 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1557,6 +1557,44 @@ static int dsa_slave_switchdev_event(struct 
notifier_block *unused,
return NOTIFY_BAD;
 }
 
+static int
+dsa_slave_switchdev_port_obj_event(unsigned long event,
+   struct net_device *netdev,
+   struct switchdev_notifier_port_obj_info *port_obj_info)
+{
+   int err = -EOPNOTSUPP;
+
+   switch (event) {
+   case SWITCHDEV_PORT_OBJ_ADD:
+   err = dsa_slave_port_obj_add(netdev, port_obj_info->obj,
+port_obj_info->trans);
+   break;
+   case SWITCHDEV_PORT_OBJ_DEL:
+   err = dsa_slave_port_obj_del(netdev, port_obj_info->obj);
+   break;
+   }
+
+   port_obj_info->handled = true;
+   return notifier_from_errno(err);
+}
+
+static int dsa_slave_switchdev_blocking_event(struct notifier_block *unused,
+ unsigned long event, void *ptr)
+{
+   struct net_device *dev = switchdev_notifier_info_to_dev(ptr);
+
+   if (!dsa_slave_dev_check(dev))
+   return NOTIFY_DONE;
+
+   switch (event) {
+   case SWITCHDEV_PORT_OBJ_ADD: /* fall through */
+   case SWITCHDEV_PORT_OBJ_DEL:
+   return dsa_slave_switchdev_port_obj_event(event, dev, ptr);
+   }
+
+   return NOTIFY_DONE;
+}
+
 static struct notifier_block dsa_slave_nb __read_mostly = {
.notifier_call  = dsa_slave_netdevice_event,
 };
@@ -1565,8 +1603,13 @@ static struct notifier_block 
dsa_slave_switchdev_notifier = {
.notifier_call = dsa_slave_switchdev_event,
 };
 
+static struct notifier_block dsa_slave_switchdev_blocking_notifier = {
+   .notifier_call = dsa_slave_switchdev_blocking_event,
+};
+
 int dsa_slave_register_notifier(void)
 {
+   struct notifier_block *nb;
int err;
 
err = register_netdevice_notifier(_slave_nb);
@@ -1577,8 +1620,15 @@ int dsa_slave_register_notifier(void)
if (err)
goto err_switchdev_nb;
 
+   nb = _slave_switchdev_blocking_notifier;
+   err = register_switchdev_blocking_notifier(nb);
+   if (err)
+   goto err_switchdev_blocking_nb;
+
return 0;
 
+err_switchdev_blocking_nb:
+   unregister_switchdev_notifier(_slave_switchdev_notifier);
 err_switchdev_nb:
unregister_netdevice_notifier(_slave_nb);
return err;
@@ -1586,8 +1636,14 @@ int dsa_slave_register_notifier(void)
 
 void dsa_slave_unregister_notifier(void)
 {
+   struct notifier_block *nb;
int err;
 
+   nb = _slave_switchdev_blocking_notifier;
+   err = unregister_switchdev_blocking_notifier(nb);
+   if (err)
+   pr_err("DSA: failed to unregister switchdev blocking notifier 
(%d)\n", err);
+
err = unregister_switchdev_notifier(_slave_switchdev_notifier);
if (err)
pr_err("DSA: failed to unregister switchdev notifier (%d)\n", 
err);
-- 
2.4.11



[PATCH net-next 02/12] switchdev: Add a blocking notifier chain

2018-11-22 Thread Petr Machata
In general one can't assume that a switchdev notifier is called in a
non-atomic context, and correspondingly, the switchdev notifier chain is
an atomic one.

However, port object addition and deletion messages are delivered from a
process context. Even the MDB addition messages, whose delivery is
scheduled from atomic context, are queued and the delivery itself takes
place in blocking context. For VLAN messages in particular, keeping the
blocking nature is important for error reporting.

Therefore introduce a blocking notifier chain and related service
functions to distribute the notifications for which a blocking context
can be assumed.

Signed-off-by: Petr Machata 
Reviewed-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 include/net/switchdev.h   | 27 +++
 net/switchdev/switchdev.c | 26 ++
 2 files changed, 53 insertions(+)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index dd969224a9b9..e021b67b9b32 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -182,10 +182,17 @@ int switchdev_port_obj_add(struct net_device *dev,
   const struct switchdev_obj *obj);
 int switchdev_port_obj_del(struct net_device *dev,
   const struct switchdev_obj *obj);
+
 int register_switchdev_notifier(struct notifier_block *nb);
 int unregister_switchdev_notifier(struct notifier_block *nb);
 int call_switchdev_notifiers(unsigned long val, struct net_device *dev,
 struct switchdev_notifier_info *info);
+
+int register_switchdev_blocking_notifier(struct notifier_block *nb);
+int unregister_switchdev_blocking_notifier(struct notifier_block *nb);
+int call_switchdev_blocking_notifiers(unsigned long val, struct net_device 
*dev,
+ struct switchdev_notifier_info *info);
+
 void switchdev_port_fwd_mark_set(struct net_device *dev,
 struct net_device *group_dev,
 bool joining);
@@ -241,6 +248,26 @@ static inline int call_switchdev_notifiers(unsigned long 
val,
return NOTIFY_DONE;
 }
 
+static inline int
+register_switchdev_blocking_notifier(struct notifier_block *nb)
+{
+   return 0;
+}
+
+static inline int
+unregister_switchdev_blocking_notifier(struct notifier_block *nb)
+{
+   return 0;
+}
+
+static inline int
+call_switchdev_blocking_notifiers(unsigned long val,
+ struct net_device *dev,
+ struct switchdev_notifier_info *info)
+{
+   return NOTIFY_DONE;
+}
+
 static inline bool switchdev_port_same_parent_id(struct net_device *a,
 struct net_device *b)
 {
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 74b9d916a58b..e109bb97ce3f 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -535,6 +535,7 @@ int switchdev_port_obj_del(struct net_device *dev,
 EXPORT_SYMBOL_GPL(switchdev_port_obj_del);
 
 static ATOMIC_NOTIFIER_HEAD(switchdev_notif_chain);
+static BLOCKING_NOTIFIER_HEAD(switchdev_blocking_notif_chain);
 
 /**
  * register_switchdev_notifier - Register notifier
@@ -576,6 +577,31 @@ int call_switchdev_notifiers(unsigned long val, struct 
net_device *dev,
 }
 EXPORT_SYMBOL_GPL(call_switchdev_notifiers);
 
+int register_switchdev_blocking_notifier(struct notifier_block *nb)
+{
+   struct blocking_notifier_head *chain = _blocking_notif_chain;
+
+   return blocking_notifier_chain_register(chain, nb);
+}
+EXPORT_SYMBOL_GPL(register_switchdev_blocking_notifier);
+
+int unregister_switchdev_blocking_notifier(struct notifier_block *nb)
+{
+   struct blocking_notifier_head *chain = _blocking_notif_chain;
+
+   return blocking_notifier_chain_unregister(chain, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_switchdev_blocking_notifier);
+
+int call_switchdev_blocking_notifiers(unsigned long val, struct net_device 
*dev,
+ struct switchdev_notifier_info *info)
+{
+   info->dev = dev;
+   return blocking_notifier_call_chain(_blocking_notif_chain,
+   val, info);
+}
+EXPORT_SYMBOL_GPL(call_switchdev_blocking_notifiers);
+
 bool switchdev_port_same_parent_id(struct net_device *a,
   struct net_device *b)
 {
-- 
2.4.11



[PATCH net-next 01/12] switchdev: SWITCHDEV_OBJ_PORT_{VLAN, MDB}(): Sanitize

2018-11-22 Thread Petr Machata
The two macros SWITCHDEV_OBJ_PORT_VLAN() and SWITCHDEV_OBJ_PORT_MDB()
expand to a container_of() call, yielding an appropriate container of
their sole argument. However, due to a name collision, the first
argument, i.e. the contained object pointer, is not the only one to get
expanded. The third argument, which is a structure member name, and
should be kept literal, gets expanded as well. The only safe way to use
these two macros is therefore to name the local variable passed to them
"obj".

To fix this, rename the sole argument of the two macros from
"obj" (which collides with the member name) to "OBJ". Additionally,
instead of passing "OBJ" to container_of() verbatim, parenthesize it, so
that a comma in the passed-in expression doesn't pollute the
container_of() invocation.

Signed-off-by: Petr Machata 
Acked-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 include/net/switchdev.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 7b371e7c4bc6..dd969224a9b9 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -95,8 +95,8 @@ struct switchdev_obj_port_vlan {
u16 vid_end;
 };
 
-#define SWITCHDEV_OBJ_PORT_VLAN(obj) \
-   container_of(obj, struct switchdev_obj_port_vlan, obj)
+#define SWITCHDEV_OBJ_PORT_VLAN(OBJ) \
+   container_of((OBJ), struct switchdev_obj_port_vlan, obj)
 
 /* SWITCHDEV_OBJ_ID_PORT_MDB */
 struct switchdev_obj_port_mdb {
@@ -105,8 +105,8 @@ struct switchdev_obj_port_mdb {
u16 vid;
 };
 
-#define SWITCHDEV_OBJ_PORT_MDB(obj) \
-   container_of(obj, struct switchdev_obj_port_mdb, obj)
+#define SWITCHDEV_OBJ_PORT_MDB(OBJ) \
+   container_of((OBJ), struct switchdev_obj_port_mdb, obj)
 
 void switchdev_trans_item_enqueue(struct switchdev_trans *trans,
  void *data, void (*destructor)(void const *),
-- 
2.4.11



[PATCH net-next 03/12] switchdev: Add SWITCHDEV_PORT_OBJ_ADD, SWITCHDEV_PORT_OBJ_DEL

2018-11-22 Thread Petr Machata
An offloading driver may need to have access to switchdev events on
ports that aren't directly under its control. An example is a VXLAN port
attached to a bridge offloaded by a driver. The driver needs to know
about VLANs configured on the VXLAN device. However the VXLAN device
isn't stashed between the bridge and a front-panel-port device (such as
is the case e.g. for LAG devices), so the usual switchdev ops don't
reach the driver.

VXLAN is likely not the only device type like this: in theory any L2
tunnel device that needs offloading will prompt requirement of this
sort. This falsifies the assumption that only the lower devices of a
front panel port need to be notified to achieve flawless offloading.

A way to fix this is to give up the notion of port object addition /
deletion as a switchdev operation, which assumes somewhat tight coupling
between the message producer and consumer. And instead send the message
over a notifier chain.

To that end, introduce two new switchdev notifier types,
SWITCHDEV_PORT_OBJ_ADD and SWITCHDEV_PORT_OBJ_DEL. These notifier types
communicate the same event as the corresponding switchdev op, except in
a form of a notification. struct switchdev_notifier_port_obj_info was
added to carry the fields that the switchdev op carries. An additional
field, handled, will be used to communicate back to switchdev that the
event has reached an interested party, which will be important for the
two-phase commit.

The two switchdev operations themselves are kept in place. Following
patches first convert individual clients to the notifier protocol, and
only then are the operations removed.

Signed-off-by: Petr Machata 
Acked-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 include/net/switchdev.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index e021b67b9b32..a2f3ebf39301 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -146,6 +146,9 @@ enum switchdev_notifier_type {
SWITCHDEV_FDB_DEL_TO_DEVICE,
SWITCHDEV_FDB_OFFLOADED,
 
+   SWITCHDEV_PORT_OBJ_ADD, /* Blocking. */
+   SWITCHDEV_PORT_OBJ_DEL, /* Blocking. */
+
SWITCHDEV_VXLAN_FDB_ADD_TO_BRIDGE,
SWITCHDEV_VXLAN_FDB_DEL_TO_BRIDGE,
SWITCHDEV_VXLAN_FDB_ADD_TO_DEVICE,
@@ -165,6 +168,13 @@ struct switchdev_notifier_fdb_info {
   offloaded:1;
 };
 
+struct switchdev_notifier_port_obj_info {
+   struct switchdev_notifier_info info; /* must be first */
+   const struct switchdev_obj *obj;
+   struct switchdev_trans *trans;
+   bool handled;
+};
+
 static inline struct net_device *
 switchdev_notifier_info_to_dev(const struct switchdev_notifier_info *info)
 {
-- 
2.4.11



[PATCH net-next 00/12] switchdev: Convert switchdev_port_obj_{add,del}() to notifiers

2018-11-22 Thread Petr Machata
An offloading driver may need to have access to switchdev events on
ports that aren't directly under its control. An example is a VXLAN port
attached to a bridge offloaded by a driver. The driver needs to know
about VLANs configured on the VXLAN device. However the VXLAN device
isn't stashed between the bridge and a front-panel-port device (such as
is the case e.g. for LAG devices), so the usual switchdev ops don't
reach the driver.

VXLAN is likely not the only device type like this: in theory any L2
tunnel device that needs offloading will prompt requirement of this
sort.

A way to fix this is to give up the notion of port object addition /
deletion as a switchdev operation, which assumes somewhat tight coupling
between the message producer and consumer. And instead send the message
over a notifier chain.

The series starts with a clean-up patch #1, where
SWITCHDEV_OBJ_PORT_{VLAN, MDB}() are fixed up to lift the constraint
that the passed-in argument be a simple variable named "obj".

switchdev_port_obj_add and _del are invoked in a context that permits
blocking. Not only that, at least for the VLAN notification, being able
to signal failure is actually important. Therefore introduce a new
blocking notifier chain that the new events will be sent on. That's done
in patch #2. Retain the current (atomic) notifier chain for the
preexisting notifications.

In patch #3, introduce two new switchdev notifier types,
SWITCHDEV_PORT_OBJ_ADD and SWITCHDEV_PORT_OBJ_DEL. These notifier types
communicate the same event as the corresponding switchdev op, except in
a form of a notification. struct switchdev_notifier_port_obj_info was
added to carry the fields that correspond to the switchdev op arguments.
An additional field, handled, will be used to communicate back to
switchdev that the event has reached an interested party, which will be
important for the two-phase commit.

In patches #4, #5, and #7, rocker, DSA resp. ethsw are updated to
subscribe to the switchdev blocking notifier chain, and handle the new
notifier types. #6 introduces a helper to determine whether a
netdevice corresponds to a front panel port.

What these three drivers have in common is that their ports don't
support any uppers besides bridge. That makes it possible to ignore any
notifiers that don't reference a front-panel port device, because they
are certainly out of scope.

Unlike the previous three, mlxsw and ocelot drivers admit stacked
devices as uppers. While the current switchdev code recursively descends
through layers of lower devices, eventually calling the op on a
front-panel port device, the notifier would reference a stacking device
that's one of front-panel ports uppers. The filtering is thus more
complex.

For ocelot, such iteration is currently pretty much required, because
there's no bookkeeping of LAG devices. mlxsw does keep the list of LAGs,
however it iterates the lower devices anyway when deciding whether an
event on a tunnel device pertains to the driver or not.

Therefore this patch set instead introduces, in patch #8, a helper to
iterate through lowers, much like the current switchdev code does,
looking for devices that match a given predicate.

Then in patches #9 and #10, first mlxsw and then ocelot are updated to
dispatch the newly-added notifier types to the preexisting
port_obj_add/_del handlers. The dispatch is done via the new helper, to
recursively descend through lower devices.

Finally in patch #11, the actual switch is made, retiring the current
SDO-based code in favor of a notifier.

Now that the event is distributed through a notifier, the explicit
netdevice check in rocker, DSA and ethsw doesn't let through any events
except those done on a front-panel port itself. It is therefore
unnecessary to check in VLAN-handling code whether a VLAN was added to
the bridge itself: such events will simply be ignored much sooner.
Therefore remove it in patch #12.

Petr Machata (12):
  switchdev: SWITCHDEV_OBJ_PORT_{VLAN, MDB}(): Sanitize
  switchdev: Add a blocking notifier chain
  switchdev: Add SWITCHDEV_PORT_OBJ_ADD, SWITCHDEV_PORT_OBJ_DEL
  rocker: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL
  net: dsa: slave: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL
  staging: fsl-dpaa2: ethsw: Introduce ethsw_port_dev_check()
  staging: fsl-dpaa2: ethsw: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL
  switchdev: Add helpers to aid traversal through lower devices
  mlxsw: spectrum_switchdev: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL
  ocelot: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL
  switchdev: Replace port obj add/del SDO with a notification
  rocker, dsa, ethsw: Don't filter VLAN events on bridge itself

 .../ethernet/mellanox/mlxsw/spectrum_switchdev.c   |  47 -
 drivers/net/ethernet/mscc/ocelot.c |  30 +++-
 drivers/net/ethernet/mscc/ocelot.h |   1 +
 drivers/net/ethernet/mscc/ocelot_board.c   |   3 +
 drivers/net/ethernet/rocker/rocker_main.c  |  60 ++-
 drivers/staging/fsl-dpaa2/ethsw/ethsw.c|  68 

Re: [EXT] Re: [PATCH net-next 4/4] octeontx2-af: Bringup CGX LMAC links by default

2018-11-22 Thread Cherian, Linu
On Thu Nov 22, 2018 at 07:26:56PM +0100, Andrew Lunn wrote:
> External Email
> 
> External Email
> 
> --
> On Thu, Nov 22, 2018 at 05:18:37PM +0530, Linu Cherian wrote:
> > From: Linu Cherian 
> >
> > - Added new CGX firmware interface API for sending link up/down
> >   commands
> >
> > - Do link up for cgx lmac ports by default at the time of CGX
> >   driver probe.
> 
> Hi Linu
> 
> This is a complex driver which i don't understand...
> 
> By link up, do you mean the equivalent of 'ip link set up dev ethX'?

Not really. It is used to do the necessary LMAC port hardware configuration
based on the connected PHYs and bringup the the PHY links.


> 
>Andrew

-- 
Linu cherian


[PATCH net-next v3 4/5] netns: enable to specify a nsid for a get request

2018-11-22 Thread Nicolas Dichtel
Combined with NETNSA_TARGET_NSID, it enables to "translate" a nsid from one
netns to a nsid of another netns.
This is useful when using NETLINK_F_LISTEN_ALL_NSID because it helps the
user to interpret a nsid received from an other netns.

Signed-off-by: Nicolas Dichtel 
Reviewed-by: David Ahern 
---
 net/core/net_namespace.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 885c54197e31..dd25fb22ad45 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -797,6 +797,11 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
} else if (tb[NETNSA_FD]) {
peer = get_net_ns_by_fd(nla_get_u32(tb[NETNSA_FD]));
nla = tb[NETNSA_FD];
+   } else if (tb[NETNSA_NSID]) {
+   peer = get_net_ns_by_id(net, nla_get_u32(tb[NETNSA_NSID]));
+   if (!peer)
+   peer = ERR_PTR(-ENOENT);
+   nla = tb[NETNSA_NSID];
} else {
NL_SET_ERR_MSG(extack, "Peer netns reference is missing");
return -EINVAL;
-- 
2.18.0



[PATCH net-next v3 1/5] netns: remove net arg from rtnl_net_fill()

2018-11-22 Thread Nicolas Dichtel
This argument is not used anymore.

Fixes: cab3c8ec8d57 ("netns: always provide the id to rtnl_net_fill()")
Signed-off-by: Nicolas Dichtel 
Reviewed-by: David Ahern 
---
 net/core/net_namespace.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index fefe72774aeb..52b9620e3457 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -739,7 +739,7 @@ static int rtnl_net_get_size(void)
 }
 
 static int rtnl_net_fill(struct sk_buff *skb, u32 portid, u32 seq, int flags,
-int cmd, struct net *net, int nsid)
+int cmd, int nsid)
 {
struct nlmsghdr *nlh;
struct rtgenmsg *rth;
@@ -801,7 +801,7 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 
id = peernet2id(net, peer);
err = rtnl_net_fill(msg, NETLINK_CB(skb).portid, nlh->nlmsg_seq, 0,
-   RTM_NEWNSID, net, id);
+   RTM_NEWNSID, id);
if (err < 0)
goto err_out;
 
@@ -816,7 +816,6 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 }
 
 struct rtnl_net_dump_cb {
-   struct net *net;
struct sk_buff *skb;
struct netlink_callback *cb;
int idx;
@@ -833,7 +832,7 @@ static int rtnl_net_dumpid_one(int id, void *peer, void 
*data)
 
ret = rtnl_net_fill(net_cb->skb, NETLINK_CB(net_cb->cb->skb).portid,
net_cb->cb->nlh->nlmsg_seq, NLM_F_MULTI,
-   RTM_NEWNSID, net_cb->net, id);
+   RTM_NEWNSID, id);
if (ret < 0)
return ret;
 
@@ -846,7 +845,6 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct 
netlink_callback *cb)
 {
struct net *net = sock_net(skb->sk);
struct rtnl_net_dump_cb net_cb = {
-   .net = net,
.skb = skb,
.cb = cb,
.idx = 0,
@@ -876,7 +874,7 @@ static void rtnl_net_notifyid(struct net *net, int cmd, int 
id)
if (!msg)
goto out;
 
-   err = rtnl_net_fill(msg, 0, 0, 0, cmd, net, id);
+   err = rtnl_net_fill(msg, 0, 0, 0, cmd, id);
if (err < 0)
goto err_out;
 
-- 
2.18.0



[PATCH net-next v3 2/5] netns: introduce 'struct net_fill_args'

2018-11-22 Thread Nicolas Dichtel
This is a preparatory work. To avoid having to much arguments for the
function rtnl_net_fill(), a new structure is defined.

Signed-off-by: Nicolas Dichtel 
Reviewed-by: David Ahern 
---
 net/core/net_namespace.c | 48 
 1 file changed, 34 insertions(+), 14 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 52b9620e3457..f8a5966b086c 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -738,20 +738,28 @@ static int rtnl_net_get_size(void)
   ;
 }
 
-static int rtnl_net_fill(struct sk_buff *skb, u32 portid, u32 seq, int flags,
-int cmd, int nsid)
+struct net_fill_args {
+   u32 portid;
+   u32 seq;
+   int flags;
+   int cmd;
+   int nsid;
+};
+
+static int rtnl_net_fill(struct sk_buff *skb, struct net_fill_args *args)
 {
struct nlmsghdr *nlh;
struct rtgenmsg *rth;
 
-   nlh = nlmsg_put(skb, portid, seq, cmd, sizeof(*rth), flags);
+   nlh = nlmsg_put(skb, args->portid, args->seq, args->cmd, sizeof(*rth),
+   args->flags);
if (!nlh)
return -EMSGSIZE;
 
rth = nlmsg_data(nlh);
rth->rtgen_family = AF_UNSPEC;
 
-   if (nla_put_s32(skb, NETNSA_NSID, nsid))
+   if (nla_put_s32(skb, NETNSA_NSID, args->nsid))
goto nla_put_failure;
 
nlmsg_end(skb, nlh);
@@ -767,10 +775,15 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 {
struct net *net = sock_net(skb->sk);
struct nlattr *tb[NETNSA_MAX + 1];
+   struct net_fill_args fillargs = {
+   .portid = NETLINK_CB(skb).portid,
+   .seq = nlh->nlmsg_seq,
+   .cmd = RTM_NEWNSID,
+   };
struct nlattr *nla;
struct sk_buff *msg;
struct net *peer;
-   int err, id;
+   int err;
 
err = nlmsg_parse(nlh, sizeof(struct rtgenmsg), tb, NETNSA_MAX,
  rtnl_net_policy, extack);
@@ -799,9 +812,8 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
}
 
-   id = peernet2id(net, peer);
-   err = rtnl_net_fill(msg, NETLINK_CB(skb).portid, nlh->nlmsg_seq, 0,
-   RTM_NEWNSID, id);
+   fillargs.nsid = peernet2id(net, peer);
+   err = rtnl_net_fill(msg, );
if (err < 0)
goto err_out;
 
@@ -817,7 +829,7 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 
 struct rtnl_net_dump_cb {
struct sk_buff *skb;
-   struct netlink_callback *cb;
+   struct net_fill_args fillargs;
int idx;
int s_idx;
 };
@@ -830,9 +842,8 @@ static int rtnl_net_dumpid_one(int id, void *peer, void 
*data)
if (net_cb->idx < net_cb->s_idx)
goto cont;
 
-   ret = rtnl_net_fill(net_cb->skb, NETLINK_CB(net_cb->cb->skb).portid,
-   net_cb->cb->nlh->nlmsg_seq, NLM_F_MULTI,
-   RTM_NEWNSID, id);
+   net_cb->fillargs.nsid = id;
+   ret = rtnl_net_fill(net_cb->skb, _cb->fillargs);
if (ret < 0)
return ret;
 
@@ -846,7 +857,12 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct 
netlink_callback *cb)
struct net *net = sock_net(skb->sk);
struct rtnl_net_dump_cb net_cb = {
.skb = skb,
-   .cb = cb,
+   .fillargs = {
+   .portid = NETLINK_CB(cb->skb).portid,
+   .seq = cb->nlh->nlmsg_seq,
+   .flags = NLM_F_MULTI,
+   .cmd = RTM_NEWNSID,
+   },
.idx = 0,
.s_idx = cb->args[0],
};
@@ -867,6 +883,10 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct 
netlink_callback *cb)
 
 static void rtnl_net_notifyid(struct net *net, int cmd, int id)
 {
+   struct net_fill_args fillargs = {
+   .cmd = cmd,
+   .nsid = id,
+   };
struct sk_buff *msg;
int err = -ENOMEM;
 
@@ -874,7 +894,7 @@ static void rtnl_net_notifyid(struct net *net, int cmd, int 
id)
if (!msg)
goto out;
 
-   err = rtnl_net_fill(msg, 0, 0, 0, cmd, id);
+   err = rtnl_net_fill(msg, );
if (err < 0)
goto err_out;
 
-- 
2.18.0



[PATCH net-next v3 5/5] netns: enable to dump full nsid translation table

2018-11-22 Thread Nicolas Dichtel
Like the previous patch, the goal is to ease to convert nsids from one
netns to another netns.
A new attribute (NETNSA_CURRENT_NSID) is added to the kernel answer when
NETNSA_TARGET_NSID is provided, thus the user can easily convert nsids.

Signed-off-by: Nicolas Dichtel 
---
 include/uapi/linux/net_namespace.h |  1 +
 net/core/net_namespace.c   | 31 --
 2 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/net_namespace.h 
b/include/uapi/linux/net_namespace.h
index 0ed9dd61d32a..9f9956809565 100644
--- a/include/uapi/linux/net_namespace.h
+++ b/include/uapi/linux/net_namespace.h
@@ -17,6 +17,7 @@ enum {
NETNSA_PID,
NETNSA_FD,
NETNSA_TARGET_NSID,
+   NETNSA_CURRENT_NSID,
__NETNSA_MAX,
 };
 
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index dd25fb22ad45..2f25d7f2a43b 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -736,6 +736,7 @@ static int rtnl_net_get_size(void)
 {
return NLMSG_ALIGN(sizeof(struct rtgenmsg))
   + nla_total_size(sizeof(s32)) /* NETNSA_NSID */
+  + nla_total_size(sizeof(s32)) /* NETNSA_CURRENT_NSID */
   ;
 }
 
@@ -745,6 +746,8 @@ struct net_fill_args {
int flags;
int cmd;
int nsid;
+   bool add_ref;
+   int ref_nsid;
 };
 
 static int rtnl_net_fill(struct sk_buff *skb, struct net_fill_args *args)
@@ -763,6 +766,10 @@ static int rtnl_net_fill(struct sk_buff *skb, struct 
net_fill_args *args)
if (nla_put_s32(skb, NETNSA_NSID, args->nsid))
goto nla_put_failure;
 
+   if (args->add_ref &&
+   nla_put_s32(skb, NETNSA_CURRENT_NSID, args->ref_nsid))
+   goto nla_put_failure;
+
nlmsg_end(skb, nlh);
return 0;
 
@@ -782,7 +789,6 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
.cmd = RTM_NEWNSID,
};
struct net *peer, *target = net;
-   bool put_target = false;
struct nlattr *nla;
struct sk_buff *msg;
int err;
@@ -824,7 +830,8 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
err = PTR_ERR(target);
goto out;
}
-   put_target = true;
+   fillargs.add_ref = true;
+   fillargs.ref_nsid = peernet2id(net, peer);
}
 
msg = nlmsg_new(rtnl_net_get_size(), GFP_KERNEL);
@@ -844,7 +851,7 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 err_out:
nlmsg_free(msg);
 out:
-   if (put_target)
+   if (fillargs.add_ref)
put_net(target);
put_net(peer);
return err;
@@ -852,11 +859,11 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 
 struct rtnl_net_dump_cb {
struct net *tgt_net;
+   struct net *ref_net;
struct sk_buff *skb;
struct net_fill_args fillargs;
int idx;
int s_idx;
-   bool put_tgt_net;
 };
 
 static int rtnl_net_dumpid_one(int id, void *peer, void *data)
@@ -868,6 +875,8 @@ static int rtnl_net_dumpid_one(int id, void *peer, void 
*data)
goto cont;
 
net_cb->fillargs.nsid = id;
+   if (net_cb->fillargs.add_ref)
+   net_cb->fillargs.ref_nsid = __peernet2id(net_cb->ref_net, peer);
ret = rtnl_net_fill(net_cb->skb, _cb->fillargs);
if (ret < 0)
return ret;
@@ -904,8 +913,9 @@ static int rtnl_valid_dump_net_req(const struct nlmsghdr 
*nlh, struct sock *sk,
   "Invalid target network 
namespace id");
return PTR_ERR(net);
}
+   net_cb->fillargs.add_ref = true;
+   net_cb->ref_net = net_cb->tgt_net;
net_cb->tgt_net = net;
-   net_cb->put_tgt_net = true;
} else {
NL_SET_BAD_ATTR(extack, tb[i]);
NL_SET_ERR_MSG(extack,
@@ -940,12 +950,21 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct 
netlink_callback *cb)
}
 
spin_lock_bh(_cb.tgt_net->nsid_lock);
+   if (net_cb.fillargs.add_ref &&
+   !net_eq(net_cb.ref_net, net_cb.tgt_net) &&
+   !spin_trylock_bh(_cb.ref_net->nsid_lock)) {
+   err = -EAGAIN;
+   goto end;
+   }
idr_for_each(_cb.tgt_net->netns_ids, rtnl_net_dumpid_one, _cb);
+   if (net_cb.fillargs.add_ref &&
+   !net_eq(net_cb.ref_net, net_cb.tgt_net))
+   spin_unlock_bh(_cb.ref_net->nsid_lock);
spin_unlock_bh(_cb.tgt_net->nsid_lock);
 
cb->args[0] = net_cb.idx;
 end:
-   if (net_cb.put_tgt_net)
+   if (net_cb.fillargs.add_ref)
put_net(net_cb.tgt_net);
return err < 0 ? err : skb->len;
 }
-- 
2.18.0



[PATCH net-next v3 0/5] Ease to interpret net-nsid

2018-11-22 Thread Nicolas Dichtel


The goal of this series is to ease the interpretation of nsid received in
netlink messages from other netns (when the user uses
NETLINK_F_LISTEN_ALL_NSID).

After this series, with a patched iproute2:

$ ip netns add foo
$ ip netns add bar
$ touch /var/run/netns/init_net
$ mount --bind /proc/1/ns/net /var/run/netns/init_net
$ ip netns set init_net 11
$ ip netns set foo 12
$ ip netns set bar 13
$ ip netns
init_net (id: 11)
bar (id: 13)
foo (id: 12)
$ ip -n foo netns set init_net 21
$ ip -n foo netns set foo 22
$ ip -n foo netns set bar 23
$ ip -n foo netns
init_net (id: 21)
bar (id: 23)
foo (id: 22)
$ ip -n bar netns set init_net 31
$ ip -n bar netns set foo 32
$ ip -n bar netns set bar 33
$ ip -n bar netns
init_net (id: 31)
bar (id: 33)
foo (id: 32)
$ ip netns list-id target-nsid 12
nsid 21 current-nsid 11 (iproute2 netns name: init_net)
nsid 22 current-nsid 12 (iproute2 netns name: foo)
nsid 23 current-nsid 13 (iproute2 netns name: bar)
$ ip -n bar netns list-id target-nsid 32 nsid 31
nsid 21 current-nsid 31 (iproute2 netns name: init_net)

v2 -> v3:
  - patch 5/5: account NETNSA_CURRENT_NSID in rtnl_net_get_size()

v1 -> v2:
  - patch 1/5: remove net from struct rtnl_net_dump_cb
  - patch 2/5: new in this version
  - patch 3/5: use a bool to know if rtnl_get_net_ns_capable() was called
  - patch 5/5: use struct net_fill_args

 include/uapi/linux/net_namespace.h |   2 +
 net/core/net_namespace.c   | 158 +++--
 2 files changed, 134 insertions(+), 26 deletions(-)

Comments are welcomed,
Regards,
Nicolas



[PATCH net-next v3 3/5] netns: add support of NETNSA_TARGET_NSID

2018-11-22 Thread Nicolas Dichtel
Like it was done for link and address, add the ability to perform get/dump
in another netns by specifying a target nsid attribute.

Signed-off-by: Nicolas Dichtel 
Reviewed-by: David Ahern 
---
 include/uapi/linux/net_namespace.h |  1 +
 net/core/net_namespace.c   | 86 ++
 2 files changed, 76 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/net_namespace.h 
b/include/uapi/linux/net_namespace.h
index 0187c74d8889..0ed9dd61d32a 100644
--- a/include/uapi/linux/net_namespace.h
+++ b/include/uapi/linux/net_namespace.h
@@ -16,6 +16,7 @@ enum {
NETNSA_NSID,
NETNSA_PID,
NETNSA_FD,
+   NETNSA_TARGET_NSID,
__NETNSA_MAX,
 };
 
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index f8a5966b086c..885c54197e31 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -669,6 +669,7 @@ static const struct nla_policy rtnl_net_policy[NETNSA_MAX + 
1] = {
[NETNSA_NSID]   = { .type = NLA_S32 },
[NETNSA_PID]= { .type = NLA_U32 },
[NETNSA_FD] = { .type = NLA_U32 },
+   [NETNSA_TARGET_NSID]= { .type = NLA_S32 },
 };
 
 static int rtnl_net_newid(struct sk_buff *skb, struct nlmsghdr *nlh,
@@ -780,9 +781,10 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
.seq = nlh->nlmsg_seq,
.cmd = RTM_NEWNSID,
};
+   struct net *peer, *target = net;
+   bool put_target = false;
struct nlattr *nla;
struct sk_buff *msg;
-   struct net *peer;
int err;
 
err = nlmsg_parse(nlh, sizeof(struct rtgenmsg), tb, NETNSA_MAX,
@@ -806,13 +808,27 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
return PTR_ERR(peer);
}
 
+   if (tb[NETNSA_TARGET_NSID]) {
+   int id = nla_get_s32(tb[NETNSA_TARGET_NSID]);
+
+   target = rtnl_get_net_ns_capable(NETLINK_CB(skb).sk, id);
+   if (IS_ERR(target)) {
+   NL_SET_BAD_ATTR(extack, tb[NETNSA_TARGET_NSID]);
+   NL_SET_ERR_MSG(extack,
+  "Target netns reference is invalid");
+   err = PTR_ERR(target);
+   goto out;
+   }
+   put_target = true;
+   }
+
msg = nlmsg_new(rtnl_net_get_size(), GFP_KERNEL);
if (!msg) {
err = -ENOMEM;
goto out;
}
 
-   fillargs.nsid = peernet2id(net, peer);
+   fillargs.nsid = peernet2id(target, peer);
err = rtnl_net_fill(msg, );
if (err < 0)
goto err_out;
@@ -823,15 +839,19 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 err_out:
nlmsg_free(msg);
 out:
+   if (put_target)
+   put_net(target);
put_net(peer);
return err;
 }
 
 struct rtnl_net_dump_cb {
+   struct net *tgt_net;
struct sk_buff *skb;
struct net_fill_args fillargs;
int idx;
int s_idx;
+   bool put_tgt_net;
 };
 
 static int rtnl_net_dumpid_one(int id, void *peer, void *data)
@@ -852,10 +872,50 @@ static int rtnl_net_dumpid_one(int id, void *peer, void 
*data)
return 0;
 }
 
+static int rtnl_valid_dump_net_req(const struct nlmsghdr *nlh, struct sock *sk,
+  struct rtnl_net_dump_cb *net_cb,
+  struct netlink_callback *cb)
+{
+   struct netlink_ext_ack *extack = cb->extack;
+   struct nlattr *tb[NETNSA_MAX + 1];
+   int err, i;
+
+   err = nlmsg_parse_strict(nlh, sizeof(struct rtgenmsg), tb, NETNSA_MAX,
+rtnl_net_policy, extack);
+   if (err < 0)
+   return err;
+
+   for (i = 0; i <= NETNSA_MAX; i++) {
+   if (!tb[i])
+   continue;
+
+   if (i == NETNSA_TARGET_NSID) {
+   struct net *net;
+
+   net = rtnl_get_net_ns_capable(sk, nla_get_s32(tb[i]));
+   if (IS_ERR(net)) {
+   NL_SET_BAD_ATTR(extack, tb[i]);
+   NL_SET_ERR_MSG(extack,
+  "Invalid target network 
namespace id");
+   return PTR_ERR(net);
+   }
+   net_cb->tgt_net = net;
+   net_cb->put_tgt_net = true;
+   } else {
+   NL_SET_BAD_ATTR(extack, tb[i]);
+   NL_SET_ERR_MSG(extack,
+  "Unsupported attribute in dump request");
+   return -EINVAL;
+   }
+   }
+
+   return 0;
+}
+
 static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb)
 {
-   struct net *net = sock_net(skb->sk);
struct 

[PATCH v2] samples: bpf: fix: error handling regarding kprobe_events

2018-11-22 Thread Daniel T. Lee
Currently, kprobe_events failure won't be handled properly.
Due to calling system() indirectly to write to kprobe_events,
it can't be identified whether an error is derived from kprobe or system.

// buf = "echo '%c:%s %s' >> /s/k/d/t/kprobe_events"
err = system(buf);
if (err < 0) {
printf("failed to create kprobe ..");
return -1;
}

For example, running ./tracex7 sample in ext4 partition,
"echo p:open_ctree open_ctree >> /s/k/d/t/kprobe_events"
gets 256 error code system() failure.
=> The error comes from kprobe, but it's not handled correctly.

According to man of system(3), it's return value
just passes the termination status of the child shell
rather than treating the error as -1. (don't care success)

Which means, currently it's not working as desired.
(According to the upper code snippet)

ex) running ./tracex7 with ext4 env.
# Current Output
sh: echo: I/O error
failed to open event open_ctree

# Desired Output
failed to create kprobe 'open_ctree' error 'No such file or directory'

The problem is, error can't be verified whether from child ps or system.

But using write() directly can verify the command failure,
and it will treat all error as -1.

So I suggest using write() directly to 'kprobe_events'
rather than calling system().

Signed-off-by: Daniel T. Lee 
---
Changes in v2:
  - Fix code style at variable declaration.

 samples/bpf/bpf_load.c | 33 -
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index e6d7e0fe155b..96783207de4a 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -54,6 +54,23 @@ static int populate_prog_array(const char *event, int 
prog_fd)
return 0;
 }
 
+static int write_kprobe_events(const char *val)
+{
+   int fd, ret, flags;
+
+   if ((val != NULL) && (val[0] == '\0'))
+   flags = O_WRONLY | O_TRUNC;
+   else
+   flags = O_WRONLY | O_APPEND;
+
+   fd = open("/sys/kernel/debug/tracing/kprobe_events", flags);
+
+   ret = write(fd, val, strlen(val));
+   close(fd);
+
+   return ret;
+}
+
 static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 {
bool is_socket = strncmp(event, "socket", 6) == 0;
@@ -165,10 +182,9 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int size)
 
 #ifdef __x86_64__
if (strncmp(event, "sys_", 4) == 0) {
-   snprintf(buf, sizeof(buf),
-"echo '%c:__x64_%s __x64_%s' >> 
/sys/kernel/debug/tracing/kprobe_events",
-is_kprobe ? 'p' : 'r', event, event);
-   err = system(buf);
+   snprintf(buf, sizeof(buf), "%c:__x64_%s __x64_%s",
+   is_kprobe ? 'p' : 'r', event, event);
+   err = write_kprobe_events(buf);
if (err >= 0) {
need_normal_check = false;
event_prefix = "__x64_";
@@ -176,10 +192,9 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int size)
}
 #endif
if (need_normal_check) {
-   snprintf(buf, sizeof(buf),
-"echo '%c:%s %s' >> 
/sys/kernel/debug/tracing/kprobe_events",
-is_kprobe ? 'p' : 'r', event, event);
-   err = system(buf);
+   snprintf(buf, sizeof(buf), "%c:%s %s",
+   is_kprobe ? 'p' : 'r', event, event);
+   err = write_kprobe_events(buf);
if (err < 0) {
printf("failed to create kprobe '%s' error 
'%s'\n",
   event, strerror(errno));
@@ -519,7 +534,7 @@ static int do_load_bpf_file(const char *path, fixup_map_cb 
fixup_map)
return 1;
 
/* clear all kprobes */
-   i = system("echo \"\" > /sys/kernel/debug/tracing/kprobe_events");
+   i = write_kprobe_events("");
 
/* scan over all elf sections to get license and map info */
for (i = 1; i < ehdr.e_shnum; i++) {
-- 
2.17.1



Re: [RFC PATCH bpf-next] libbpf: make bpf_object__open default to UNSPEC

2018-11-22 Thread Daniel Borkmann
[ +Wang ]

On 11/22/2018 07:03 AM, Nikita V. Shirokov wrote:
> currently by default libbpf's bpf_object__open requires
> bpf's program to specify  version in a code because of two things:
> 1) default prog type is set to KPROBE
> 2) KPROBE requires (in kernel/bpf/syscall.c) version to be specified
> 
> in this RFC i'm proposing change default to UNSPEC and also changing
> logic of libbpf that it would reflect what we have today in kernel
> (aka only KPROBE type requires for version to be explicitly set).
> 
> reason for change:
> currently only libbpf requires by default version to be
> explicitly set. it would be really hard for mainteiners of other custom
> bpf loaders to migrate to libbpf (as they dont control user's code
> and migration to the new loader (libbpf) wont be transparent for end
> user).
> 
> what is going to be broken after this change:
> if someone were relying on default to be KPROBE for bpf_object__open
> his code will stop to work. however i'm really doubtfull that anyone
> is using this for kprobe type of programs (instead of, say, bcc or
> other tracing frameworks)
> 
> other possible solutions (for discussion, would require more machinery):
> add another function like bpf_object__open w/ default to unspec
> 
> Signed-off-by: Nikita V. Shirokov 
> ---
>  tools/lib/bpf/libbpf.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 0f14f7c074c2..ed4212a4c5f9 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -333,7 +333,7 @@ bpf_program__init(void *data, size_t size, char 
> *section_name, int idx,
>   prog->idx = idx;
>   prog->instances.fds = NULL;
>   prog->instances.nr = -1;
> - prog->type = BPF_PROG_TYPE_KPROBE;
> + prog->type = BPF_PROG_TYPE_UNSPEC;
>   prog->btf_fd = -1;

Seems this was mostly for historic reasons, but for a generic library this
would indeed be an odd convention for default. Wang, given 5f44e4c810bf
("tools lib bpf: New API to adjust type of a BPF program"), are you in any
way relying on this default or using things like bpf_program__set_kprobe()
instead which you've added there? If latter, I'd say we should then change
it better now than later when there's even more lib usage (and in particular
before we add official ABI versioning).

>   return 0;
> @@ -1649,12 +1649,12 @@ static bool bpf_prog_type__needs_kver(enum 
> bpf_prog_type type)
>   case BPF_PROG_TYPE_LIRC_MODE2:
>   case BPF_PROG_TYPE_SK_REUSEPORT:
>   case BPF_PROG_TYPE_FLOW_DISSECTOR:
> - return false;
>   case BPF_PROG_TYPE_UNSPEC:
> - case BPF_PROG_TYPE_KPROBE:
>   case BPF_PROG_TYPE_TRACEPOINT:
> - case BPF_PROG_TYPE_PERF_EVENT:
>   case BPF_PROG_TYPE_RAW_TRACEPOINT:
> + case BPF_PROG_TYPE_PERF_EVENT:
> + return false;
> + case BPF_PROG_TYPE_KPROBE:
>   default:
>   return true;
>   }
> 

Thanks,
Daniel


Re: [PATCH] bpf: fix check of allowed specifiers in bpf_trace_printk

2018-11-22 Thread Daniel Borkmann
Hi Martynas,

On 11/22/2018 05:00 PM, Martynas Pumputis wrote:
> A format string consisting of "%p" or "%s" followed by an invalid
> specifier (e.g. "%p%\n" or "%s%") could pass the check which
> would make format_decode (lib/vsprintf.c) to warn.
> 
> Reported-by: syzbot+1ec5c5ec949c4adaa...@syzkaller.appspotmail.com
> Signed-off-by: Martynas Pumputis 
> ---
>  kernel/trace/bpf_trace.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 08fcfe440c63..9ab05736e1a1 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -225,6 +225,8 @@ BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, 
> u64, arg1,
>   (void *) (long) unsafe_addr,
>   sizeof(buf));
>   }
> + if (fmt[i] == '%')
> + i--;
>   continue;
>   }

Thanks for the fix! Could we simplify the logic a bit to avoid having to
navigate i back and forth which got us in trouble in the first place? Like
below (untested) perhaps?

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 08fcfe4..ff83b8c 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -196,11 +196,13 @@ BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, 
u64, arg1,
i++;
} else if (fmt[i] == 'p' || fmt[i] == 's') {
mod[fmt_cnt]++;
-   i++;
-   if (!isspace(fmt[i]) && !ispunct(fmt[i]) && fmt[i] != 0)
+   /* Disallow any further format extensions. */
+   if (fmt[i + 1] != 0 &&
+   !isspace(fmt[i + 1]) &&
+   !ispunct(fmt[i + 1]))
return -EINVAL;
fmt_cnt++;
-   if (fmt[i - 1] == 's') {
+   if (fmt[i] == 's') {
if (str_seen)
/* allow only one '%s' per fmt string */
return -EINVAL;

Thanks,
Daniel


Re: [PATCH net-next,v3 00/12] add flow_rule infrastructure

2018-11-22 Thread Marcelo Ricardo Leitner
On Thu, Nov 22, 2018 at 02:22:20PM -0200, Marcelo Ricardo Leitner wrote:
> On Wed, Nov 21, 2018 at 03:51:20AM +0100, Pablo Neira Ayuso wrote:
> > Hi,
> > 
> > This patchset is the third iteration [1] [2] [3] to introduce a kernel
> > intermediate (IR) to express ACL hardware offloads.
> 
> On v2 cover letter you had:
> 
> """
> However, cost of this layer is very small, adding 1 million rules via
> tc -batch, perf shows:
> 
>  0.06%  tc   [kernel.vmlinux][k] tc_setup_flow_action
> """
> 
> The above doesn't include time spent on children calls and I'm worried
> about the new allocation done by flow_rule_alloc(), as it can impact
> rule insertion rate. I'll run some tests here and report back.

I'm seeing +60ms on 1.75s (~3.4%) to add 40k flower rules on ingress
with skip_hw and tc in batch mode, with flows like:

filter add dev p6p2 parent : protocol ip prio 1 flower skip_hw
src_mac ec:13:db:00:00:00 dst_mac ec:14:c2:00:00:00 src_ip
56.0.0.0 dst_ip 55.0.0.0 action drop

Only 20ms out of those 60ms were consumed within fl_change() calls
(considering children calls), though.

Do you see something similar?  I used current net-next (d59da3fbfe3f)
and with this patchset applied.



[PATCH net-next 3/5] r8169: simplify detecting chip versions with same XID

2018-11-22 Thread Heiner Kallweit
For the GMII chip versions we set the version number which was set
already. This can be simplified.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 19 +++
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 1e549b26b..9a696455e 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -2116,18 +2116,13 @@ static void rtl8169_get_mac_version(struct 
rtl8169_private *tp)
 
if (tp->mac_version == RTL_GIGA_MAC_NONE) {
dev_err(tp_to_dev(tp), "unknown chip XID %03x\n", reg & 0xfcf);
-   } else if (tp->mac_version == RTL_GIGA_MAC_VER_42) {
-   tp->mac_version = tp->supports_gmii ?
- RTL_GIGA_MAC_VER_42 :
- RTL_GIGA_MAC_VER_43;
-   } else if (tp->mac_version == RTL_GIGA_MAC_VER_45) {
-   tp->mac_version = tp->supports_gmii ?
- RTL_GIGA_MAC_VER_45 :
- RTL_GIGA_MAC_VER_47;
-   } else if (tp->mac_version == RTL_GIGA_MAC_VER_46) {
-   tp->mac_version = tp->supports_gmii ?
- RTL_GIGA_MAC_VER_46 :
- RTL_GIGA_MAC_VER_48;
+   } else if (!tp->supports_gmii) {
+   if (tp->mac_version == RTL_GIGA_MAC_VER_42)
+   tp->mac_version = RTL_GIGA_MAC_VER_43;
+   else if (tp->mac_version == RTL_GIGA_MAC_VER_45)
+   tp->mac_version = RTL_GIGA_MAC_VER_47;
+   else if (tp->mac_version == RTL_GIGA_MAC_VER_46)
+   tp->mac_version = RTL_GIGA_MAC_VER_48;
}
 }
 
-- 
2.19.1




[PATCH net-next 4/5] r8169: use napi_consume_skb where possible

2018-11-22 Thread Heiner Kallweit
Use napi_consume_skb() where possible to profit from
bulk free infrastructure.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 9a696455e..2ca0b2ed9 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -6204,7 +6204,8 @@ static void rtl8169_pcierr_interrupt(struct net_device 
*dev)
rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
 }
 
-static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
+static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
+  int budget)
 {
unsigned int dirty_tx, tx_left, bytes_compl = 0, pkts_compl = 0;
 
@@ -6232,7 +6233,7 @@ static void rtl_tx(struct net_device *dev, struct 
rtl8169_private *tp)
if (status & LastFrag) {
pkts_compl++;
bytes_compl += tx_skb->skb->len;
-   dev_consume_skb_any(tx_skb->skb);
+   napi_consume_skb(tx_skb->skb, budget);
tx_skb->skb = NULL;
}
dirty_tx++;
@@ -6475,7 +6476,7 @@ static int rtl8169_poll(struct napi_struct *napi, int 
budget)
 
work_done = rtl_rx(dev, tp, (u32) budget);
 
-   rtl_tx(dev, tp);
+   rtl_tx(dev, tp, budget);
 
if (work_done < budget) {
napi_complete_done(napi, work_done);
-- 
2.19.1




[PATCH net-next 5/5] r8169: replace macro TX_FRAGS_READY_FOR with a function

2018-11-22 Thread Heiner Kallweit
Replace macro TX_FRAGS_READY_FOR with function rtl_tx_slots_avail
to make code cleaner and type-safe.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 2ca0b2ed9..f768b966e 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -56,13 +56,6 @@
 #define R8169_MSG_DEFAULT \
(NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN)
 
-#define TX_SLOTS_AVAIL(tp) \
-   (tp->dirty_tx + NUM_TX_DESC - tp->cur_tx)
-
-/* A skbuff with nr_frags needs nr_frags+1 entries in the tx queue */
-#define TX_FRAGS_READY_FOR(tp,nr_frags) \
-   (TX_SLOTS_AVAIL(tp) >= (nr_frags + 1))
-
 /* Maximum number of multicast addresses to filter (vs. Rx-all-multicast).
The RTL chips use a 64 element hash table based on the Ethernet CRC. */
 static const int multicast_filter_limit = 32;
@@ -6058,6 +6051,15 @@ static bool rtl8169_tso_csum_v2(struct rtl8169_private 
*tp,
return true;
 }
 
+static bool rtl_tx_slots_avail(struct rtl8169_private *tp,
+  unsigned int nr_frags)
+{
+   unsigned int slots_avail = tp->dirty_tx + NUM_TX_DESC - tp->cur_tx;
+
+   /* A skbuff with nr_frags needs nr_frags+1 entries in the tx queue */
+   return slots_avail > nr_frags;
+}
+
 static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
  struct net_device *dev)
 {
@@ -6069,7 +6071,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
u32 opts[2], len;
int frags;
 
-   if (unlikely(!TX_FRAGS_READY_FOR(tp, skb_shinfo(skb)->nr_frags))) {
+   if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
netif_err(tp, drv, dev, "BUG! Tx Ring full when queue 
awake!\n");
goto err_stop_0;
}
@@ -6126,7 +6128,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 
mmiowb();
 
-   if (!TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) {
+   if (!rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) {
/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
 * not miss a ring update when it notices a stopped queue.
 */
@@ -6140,7 +6142,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 * can't.
 */
smp_mb();
-   if (TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS))
+   if (rtl_tx_slots_avail(tp, MAX_SKB_FRAGS))
netif_wake_queue(dev);
}
 
@@ -6258,7 +6260,7 @@ static void rtl_tx(struct net_device *dev, struct 
rtl8169_private *tp,
 */
smp_mb();
if (netif_queue_stopped(dev) &&
-   TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) {
+   rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) {
netif_wake_queue(dev);
}
/*
-- 
2.19.1




[PATCH net-next 2/5] r8169: remove default chip versions

2018-11-22 Thread Heiner Kallweit
Even the chip versions within a family have so many differences that
using a default chip version doesn't really make sense. Instead
of leaving a best case flaky network connectivity, bail out and
report the unknown chip version.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 15 +--
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index bef89ba50..1e549b26b 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -2011,8 +2011,7 @@ static const struct ethtool_ops rtl8169_ethtool_ops = {
.set_link_ksettings = phy_ethtool_set_link_ksettings,
 };
 
-static void rtl8169_get_mac_version(struct rtl8169_private *tp,
-   u8 default_version)
+static void rtl8169_get_mac_version(struct rtl8169_private *tp)
 {
/*
 * The driver currently handles the 8168Bf and the 8168Be identically
@@ -2116,9 +2115,7 @@ static void rtl8169_get_mac_version(struct 
rtl8169_private *tp,
tp->mac_version = p->mac_version;
 
if (tp->mac_version == RTL_GIGA_MAC_NONE) {
-   dev_notice(tp_to_dev(tp),
-  "unknown MAC, using family default\n");
-   tp->mac_version = default_version;
+   dev_err(tp_to_dev(tp), "unknown chip XID %03x\n", reg & 0xfcf);
} else if (tp->mac_version == RTL_GIGA_MAC_VER_42) {
tp->mac_version = tp->supports_gmii ?
  RTL_GIGA_MAC_VER_42 :
@@ -6976,27 +6973,23 @@ static const struct rtl_cfg_info {
u16 irq_mask;
unsigned int has_gmii:1;
const struct rtl_coalesce_info *coalesce_info;
-   u8 default_ver;
 } rtl_cfg_infos [] = {
[RTL_CFG_0] = {
.hw_start   = rtl_hw_start_8169,
.irq_mask   = SYSErr | LinkChg | RxOverflow | RxFIFOOver,
.has_gmii   = 1,
.coalesce_info  = rtl_coalesce_info_8169,
-   .default_ver= RTL_GIGA_MAC_VER_01,
},
[RTL_CFG_1] = {
.hw_start   = rtl_hw_start_8168,
.irq_mask   = LinkChg | RxOverflow,
.has_gmii   = 1,
.coalesce_info  = rtl_coalesce_info_8168_8136,
-   .default_ver= RTL_GIGA_MAC_VER_11,
},
[RTL_CFG_2] = {
.hw_start   = rtl_hw_start_8101,
.irq_mask   = LinkChg | RxOverflow | RxFIFOOver,
.coalesce_info  = rtl_coalesce_info_8168_8136,
-   .default_ver= RTL_GIGA_MAC_VER_13,
}
 };
 
@@ -7259,7 +7252,9 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
tp->mmio_addr = pcim_iomap_table(pdev)[region];
 
/* Identify chip attached to board */
-   rtl8169_get_mac_version(tp, cfg->default_ver);
+   rtl8169_get_mac_version(tp);
+   if (tp->mac_version == RTL_GIGA_MAC_NONE)
+   return -ENODEV;
 
if (rtl_tbi_enabled(tp)) {
dev_err(>dev, "TBI fiber mode not supported\n");
-- 
2.19.1




[PATCH net-next 1/5] r8169: remove ancient GCC bug workaround in a second place

2018-11-22 Thread Heiner Kallweit
Remove ancient GCC bug workaround in a second place and factor out
rtl_8169_get_txd_opts1.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index f5781285a..bef89ba50 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5840,6 +5840,16 @@ static void rtl8169_tx_timeout(struct net_device *dev)
rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
 }
 
+static __le32 rtl8169_get_txd_opts1(u32 opts0, u32 len, unsigned int entry)
+{
+   u32 status = opts0 | len;
+
+   if (entry == NUM_TX_DESC - 1)
+   status |= RingEnd;
+
+   return cpu_to_le32(status);
+}
+
 static int rtl8169_xmit_frags(struct rtl8169_private *tp, struct sk_buff *skb,
  u32 *opts)
 {
@@ -5852,7 +5862,7 @@ static int rtl8169_xmit_frags(struct rtl8169_private *tp, 
struct sk_buff *skb,
for (cur_frag = 0; cur_frag < info->nr_frags; cur_frag++) {
const skb_frag_t *frag = info->frags + cur_frag;
dma_addr_t mapping;
-   u32 status, len;
+   u32 len;
void *addr;
 
entry = (entry + 1) % NUM_TX_DESC;
@@ -5868,11 +5878,7 @@ static int rtl8169_xmit_frags(struct rtl8169_private 
*tp, struct sk_buff *skb,
goto err_out;
}
 
-   status = opts[0] | len;
-   if (entry == NUM_TX_DESC - 1)
-   status |= RingEnd;
-
-   txd->opts1 = cpu_to_le32(status);
+   txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
txd->opts2 = cpu_to_le32(opts[1]);
txd->addr = cpu_to_le64(mapping);
 
@@ -6068,8 +6074,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
struct TxDesc *txd = tp->TxDescArray + entry;
struct device *d = tp_to_dev(tp);
dma_addr_t mapping;
-   u32 status, len;
-   u32 opts[2];
+   u32 opts[2], len;
int frags;
 
if (unlikely(!TX_FRAGS_READY_FOR(tp, skb_shinfo(skb)->nr_frags))) {
@@ -6118,9 +6123,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
/* Force memory writes to complete before releasing descriptor */
dma_wmb();
 
-   /* Anti gcc 2.95.3 bugware (sic) */
-   status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
-   txd->opts1 = cpu_to_le32(status);
+   txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
 
/* Force all memory writes to complete before notifying device */
wmb();
-- 
2.19.1




[PATCH bpf-next] bpf: Add BPF_MAP_TYPE_QUEUE and BPF_MAP_TYPE_QUEUE to bpftool-map

2018-11-22 Thread David Calavera
I noticed that these two new BPF Maps are not defined in bpftool.
This patch defines those two maps and adds their names to the
bpftool-map documentation.

Signed-off-by: David Calavera 
---
 tools/bpf/bpftool/Documentation/bpftool-map.rst | 3 ++-
 tools/bpf/bpftool/map.c | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst 
b/tools/bpf/bpftool/Documentation/bpftool-map.rst
index f55a2daed59b..9e827e342d9e 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-map.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst
@@ -42,7 +42,8 @@ MAP COMMANDS
 |  | **percpu_array** | **stack_trace** | **cgroup_array** | 
**lru_hash**
 |  | **lru_percpu_hash** | **lpm_trie** | **array_of_maps** | 
**hash_of_maps**
 |  | **devmap** | **sockmap** | **cpumap** | **xskmap** | 
**sockhash**
-|  | **cgroup_storage** | **reuseport_sockarray** | 
**percpu_cgroup_storage** }
+|  | **cgroup_storage** | **reuseport_sockarray** | 
**percpu_cgroup_storage**
+|  | **queue** | **stack** }
 
 DESCRIPTION
 ===
diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 7bf38f0e152e..68b656b6edcc 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -74,6 +74,8 @@ static const char * const map_type_name[] = {
[BPF_MAP_TYPE_CGROUP_STORAGE]   = "cgroup_storage",
[BPF_MAP_TYPE_REUSEPORT_SOCKARRAY] = "reuseport_sockarray",
[BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE]= "percpu_cgroup_storage",
+   [BPF_MAP_TYPE_QUEUE] = "queue",
+   [BPF_MAP_TYPE_STACK] = "stack",
 };
 
 static bool map_is_per_cpu(__u32 type)
-- 
2.17.1



[PATCH net-next 0/5] r8169: some functional improvements

2018-11-22 Thread Heiner Kallweit
This series includes a few functional improvements.

Heiner Kallweit (5):
  r8169: Remove ancient GCC bug workaround in a second place
  r8169: remove default chip versions
  r8169: simplify detecting chip versions with same XID
  r8169: use napi_consume_skb where possible
  r8169: replace macro TX_FRAGS_READY_FOR with a function

 drivers/net/ethernet/realtek/r8169.c | 90 +---
 1 file changed, 43 insertions(+), 47 deletions(-)

-- 
2.19.1



Re: [PATCH bpf] bpf: fix integer overflow in queue_stack_map

2018-11-22 Thread Daniel Borkmann
On 11/22/2018 07:49 PM, Alexei Starovoitov wrote:
> fix the following issues:
> - allow queue_stack_map for root only
> - fix u32 max_entries overflow
> - disallow value_size == 0
> 
> Reported-by: Wei Wu 
> Fixes: f1a2e44a3aec ("bpf: add queue and stack maps")
> Signed-off-by: Alexei Starovoitov 

Applied, thanks everyone!


Re: DSA support for Marvell 88e6065 switch

2018-11-22 Thread Pavel Machek
Hi!

> > > > If I wanted it to work, what do I need to do? AFAICT phy autoprobing
> > > > should just attach it as soon as it is compiled in?
> > > 
> > > Nope. It is a switch, not a PHY. Switches are never auto-probed
> > > because they are not guaranteed to have ID registers.
> > > 
> > > You need to use the legacy device tree binding. Look in
> > > Documentation/devicetree/bindings/net/dsa/dsa.txt, section Deprecated
> > > Binding. You can get more examples if you checkout old kernels. Or
> > > kirkwood-rd88f6281.dtsi, the dsa { } node which is disabled.
> > 
> > Thanks; I ported code from mv88e66xx in the meantime, and switch
> > appears to be detected.
> > 
> > But I'm running into problems with tagging code, and I guess I'd like
> > some help understanding.
> > 
> > tag_trailer: allocates new skb, then copies data around.
> > 
> > tag_qca: does dev->stats.tx_packets++, and reuses existing skb.
> > 
> > tag_brcm: reuses existing skb.

Any idea why tag trailer allocates new skb, and what is going on with
dev->stats.tx_packets++?

> > Is qca wrong in adjusting the statistics? Why does trailer allocate
> > new skb?
> > 
> > 6065 seems to use 2-byte header between "SFD" and "Destination
> > address" in the ethernet frame. That's ... strange place to put
> > header, as addresses are now shifted. I need to put ethernet in
> > promisc mode (by running tcpdump) to get data moving.. and can not
> > figure out what to do in tag_...
> 
> Does this switch chip not also support trailer mode?
> 
> There's basically four tagging modes for Marvell switch chips: header
> mode (the one you described), trailer mode (tag_trailer.c), DSA and
> ethertype DSA.  The switch chips I worked on that didn't support
> (ethertype) DSA tagging did support both header and trailer modes,
> and I chose to run them in trailer mode for the reasons you describe
> above, but if your chip doesn't support trailer mode, then yes,
> you'll have to add support for header mode and put the underlying
> interface into promiscuous mode and such.

It seems that 6060 supports both header (probably, parts of docs are
redacted) and trailer mode... but I'm working with 6065. That does not
support trailer mode... or at least word "trailer" does not appear
anywhere in the documentation.

What chip were you working with? I may want to take a look on their
wording.

6065 indeed has some kind of "egress tagging mode" (with four
options), but I have trouble understanding what it really does.

Thanks,
Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH net] net: thunderx: set xdp_prog to NULL if bpf_prog_add fails

2018-11-22 Thread David Miller
From: Lorenzo Bianconi 
Date: Wed, 21 Nov 2018 16:32:10 +0100

> Set xdp_prog pointer to NULL if bpf_prog_add fails since that routine
> reports the error code instead of NULL in case of failure and xdp_prog
> pointer value is used in the driver to verify if XDP is currently
> enabled.
> Moreover report the error code to userspace if nicvf_xdp_setup fails
> 
> Fixes: 05c773f52b96 ("net: thunderx: Add basic XDP support")
> Signed-off-by: Lorenzo Bianconi 

Applied and queued up for -stable.


[PATCH v2 bpf-next] bpf: add skb->tstamp r/w access from tc clsact and cg skb progs

2018-11-22 Thread Vlad Dumitrescu
This could be used to rate limit egress traffic in concert with a qdisc
which supports Earliest Departure Time, such as FQ.

Write access from cg skb progs only with CAP_SYS_ADMIN, since the value
will be used by downstream qdiscs. It might make sense to relax this.

Changes v1 -> v2:
  - allow access from cg skb, write only with CAP_SYS_ADMIN

Signed-off-by: Vlad Dumitrescu 
---
 include/uapi/linux/bpf.h|  1 +
 net/core/filter.c   | 29 +
 tools/include/uapi/linux/bpf.h  |  1 +
 tools/testing/selftests/bpf/test_verifier.c | 29 +
 4 files changed, 60 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c1554aa074659..23e2031a43d43 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2468,6 +2468,7 @@ struct __sk_buff {
 
__u32 data_meta;
struct bpf_flow_keys *flow_keys;
+   __u64 tstamp;
 };
 
 struct bpf_tunnel_key {
diff --git a/net/core/filter.c b/net/core/filter.c
index f6ca38a7d4332..65dc13aeca7c4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5573,6 +5573,10 @@ static bool bpf_skb_is_valid_access(int off, int size, 
enum bpf_access_type type
if (size != sizeof(struct bpf_flow_keys *))
return false;
break;
+   case bpf_ctx_range(struct __sk_buff, tstamp):
+   if (size != sizeof(__u64))
+   return false;
+   break;
default:
/* Only narrow read access allowed for now. */
if (type == BPF_WRITE) {
@@ -5600,6 +5604,7 @@ static bool sk_filter_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, data_end):
case bpf_ctx_range(struct __sk_buff, flow_keys):
case bpf_ctx_range_till(struct __sk_buff, family, local_port):
+   case bpf_ctx_range(struct __sk_buff, tstamp):
return false;
}
 
@@ -5638,6 +5643,10 @@ static bool cg_skb_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, priority):
case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
break;
+   case bpf_ctx_range(struct __sk_buff, tstamp):
+   if (!capable(CAP_SYS_ADMIN))
+   return false;
+   break;
default:
return false;
}
@@ -5665,6 +5674,7 @@ static bool lwt_is_valid_access(int off, int size,
case bpf_ctx_range_till(struct __sk_buff, family, local_port):
case bpf_ctx_range(struct __sk_buff, data_meta):
case bpf_ctx_range(struct __sk_buff, flow_keys):
+   case bpf_ctx_range(struct __sk_buff, tstamp):
return false;
}
 
@@ -5874,6 +5884,7 @@ static bool tc_cls_act_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, priority):
case bpf_ctx_range(struct __sk_buff, tc_classid):
case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
+   case bpf_ctx_range(struct __sk_buff, tstamp):
break;
default:
return false;
@@ -6093,6 +6104,7 @@ static bool sk_skb_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, tc_classid):
case bpf_ctx_range(struct __sk_buff, data_meta):
case bpf_ctx_range(struct __sk_buff, flow_keys):
+   case bpf_ctx_range(struct __sk_buff, tstamp):
return false;
}
 
@@ -6179,6 +6191,7 @@ static bool flow_dissector_is_valid_access(int off, int 
size,
case bpf_ctx_range(struct __sk_buff, tc_classid):
case bpf_ctx_range(struct __sk_buff, data_meta):
case bpf_ctx_range_till(struct __sk_buff, family, local_port):
+   case bpf_ctx_range(struct __sk_buff, tstamp):
return false;
}
 
@@ -6488,6 +6501,22 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type 
type,
*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg,
  si->src_reg, off);
break;
+
+   case offsetof(struct __sk_buff, tstamp):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, tstamp) != 8);
+
+   if (type == BPF_WRITE)
+   *insn++ = BPF_STX_MEM(BPF_DW,
+ si->dst_reg, si->src_reg,
+ bpf_target_off(struct sk_buff,
+tstamp, 8,
+target_size));
+   else
+   *insn++ = BPF_LDX_MEM(BPF_DW,
+ si->dst_reg, si->src_reg,
+ bpf_target_off(struct 

Re: [PATCH net-next] {net,IB}/mlx4: Initialize CQ buffers in the driver when possible

2018-11-22 Thread David Miller
From: Tariq Toukan 
Date: Wed, 21 Nov 2018 17:12:05 +0200

> From: Daniel Jurgens 
> 
> Perform CQ initialization in the driver when the capability is supported
> by the FW.  When passing the CQ to HW indicate that the CQ buffer has
> been pre-initialized.
> 
> Doing so decreases CQ creation time.  Testing on P8 showed a single 2048
> entry CQ creation time was reduced from ~395us to ~170us, which is
> 2.3x faster.
> 
> Signed-off-by: Daniel Jurgens 
> Signed-off-by: Jack Morgenstein 
> Signed-off-by: Tariq Toukan 

Applied.


Re: [PATCH net] net/dim: Update DIM start sample after each DIM iteration

2018-11-22 Thread David Miller
From: Tal Gilboa 
Date: Wed, 21 Nov 2018 16:28:23 +0200

> On every iteration of net_dim, the algorithm may choose to
> check for the system state by comparing current data sample
> with previous data sample. After each of these comparison,
> regardless of the action taken, the sample used as baseline
> is needed to be updated.
> 
> This patch fixes a bug that causes DIM to take wrong decisions,
> due to never updating the baseline sample for comparison between
> iterations. This way, DIM always compares current sample with
> zeros.
> 
> Although this is a functional fix, it also improves and stabilizes
> performance as the algorithm works properly now.
> 
> Performance:
> Tested single UDP TX stream with pktgen:
> samples/pktgen/pktgen_sample03_burst_single_flow.sh -i p4p2 -d 1.1.1.1
> -m 24:8a:07:88:26:8b -f 3 -b 128
> 
> ConnectX-5 100GbE packet rate improved from 15-19Mpps to 19-20Mpps.
> Also, toggling between profiles is less frequent with the fix.
> 
> Fixes: 8115b750dbcb ("net/dim: use struct net_dim_sample as arg to net_dim")
> Signed-off-by: Tal Gilboa 
> Reviewed-by: Tariq Toukan 

Applied and queued up for -stable.


Re: [PATCH net-next] selftests: explicitly require kernel features needed by udpgro tests

2018-11-22 Thread David Miller
From: Paolo Abeni 
Date: Wed, 21 Nov 2018 14:31:15 +0100

> commit 3327a9c46352f1 ("selftests: add functionals test for UDP GRO")
> make use of ipv6 NAT, but such a feature is not currently implied by
> selftests. Since the 'ip[6]tables' commands may actually create nft rules,
> depending on the specific user-space version, let's pull both NF and
> NFT nat modules plus the needed deps.
> 
> Reported-by: Naresh Kamboju 
> Fixes: 3327a9c46352f1 ("selftests: add functionals test for UDP GRO")
> Signed-off-by: Paolo Abeni 

Applied.


[PATCH bpf] bpf: fix integer overflow in queue_stack_map

2018-11-22 Thread Alexei Starovoitov
fix the following issues:
- allow queue_stack_map for root only
- fix u32 max_entries overflow
- disallow value_size == 0

Reported-by: Wei Wu 
Fixes: f1a2e44a3aec ("bpf: add queue and stack maps")
Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/queue_stack_maps.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index 8bbd72d3a121..b384ea9f3254 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "percpu_freelist.h"
 
 #define QUEUE_STACK_CREATE_FLAG_MASK \
@@ -45,8 +46,12 @@ static bool queue_stack_map_is_full(struct bpf_queue_stack 
*qs)
 /* Called from syscall */
 static int queue_stack_map_alloc_check(union bpf_attr *attr)
 {
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
/* check sanity of attributes */
if (attr->max_entries == 0 || attr->key_size != 0 ||
+   attr->value_size == 0 ||
attr->map_flags & ~QUEUE_STACK_CREATE_FLAG_MASK)
return -EINVAL;
 
@@ -63,15 +68,10 @@ static struct bpf_map *queue_stack_map_alloc(union bpf_attr 
*attr)
 {
int ret, numa_node = bpf_map_attr_numa_node(attr);
struct bpf_queue_stack *qs;
-   u32 size, value_size;
-   u64 queue_size, cost;
-
-   size = attr->max_entries + 1;
-   value_size = attr->value_size;
-
-   queue_size = sizeof(*qs) + (u64) value_size * size;
+   u64 size, queue_size, cost;
 
-   cost = queue_size;
+   size = (u64) attr->max_entries + 1;
+   cost = queue_size = sizeof(*qs) + size * attr->value_size;
if (cost >= U32_MAX - PAGE_SIZE)
return ERR_PTR(-E2BIG);
 
-- 
2.17.1



Re: [PATCH v3 net-next 04/12] net: ethernet: Use phy_set_max_speed() to limit advertised speed

2018-11-22 Thread Andrew Lunn
On Thu, Nov 22, 2018 at 12:40:25PM +0200, Anssi Hannula wrote:
> Hi,
> 
> On 12.9.2018 2:53, Andrew Lunn wrote:
> > Many Ethernet MAC drivers want to limit the PHY to only advertise a
> > maximum speed of 100Mbs or 1Gbps. Rather than using a mask, make use
> > of the helper function phy_set_max_speed().
> 
> But what if the PHY does not support 1Gbps in the first place?

Yes, you are correct. __set_phy_supported() needs modifying to take
into account what the PHY can do.

Thanks for pointing this out. I will take a look.

   Andrew


Re: [PATCH net-next 4/4] octeontx2-af: Bringup CGX LMAC links by default

2018-11-22 Thread Andrew Lunn
On Thu, Nov 22, 2018 at 05:18:37PM +0530, Linu Cherian wrote:
> From: Linu Cherian 
> 
> - Added new CGX firmware interface API for sending link up/down
>   commands
> 
> - Do link up for cgx lmac ports by default at the time of CGX
>   driver probe.

Hi Linu

This is a complex driver which i don't understand...

By link up, do you mean the equivalent of 'ip link set up dev ethX'?

   Andrew


Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue

2018-11-22 Thread Eric Dumazet
On Thu, Nov 22, 2018 at 10:16 AM Eric Dumazet  wrote:

> Yes, I was considering properly filtering SACK as a refinement later [1]
> but you raise a valid point for alien stacks that are not yet using SACK :/
>
> [1] This version of the patch will not aggregate sacks since the
> memcmp() on tcp options would fail.
>
> Neal can you double check if cake_ack_filter() does not have the issue
> you just mentioned ?

Note that aggregated pure acks will have a gso_segs set to the number
of aggregated acks,
we might simply use this value later in the stack, instead of forcing
having X pure acks in the backlog
and increase memory needs and cpu costs.

Then I guess I need this fix :

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 
36c9d715bf2aa7eb7bf58b045bfeb85a2ec1a696..736f7f24cdb4fe61769faaa1644c8bff01c746c4
100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1669,7 +1669,8 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
__skb_pull(skb, hdrlen);
if (skb_try_coalesce(tail, skb, , )) {
TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
-   TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+   if (after(TCP_SKB_CB(skb)->ack_seq,
TCP_SKB_CB(tail)->ack_seq))
+   TCP_SKB_CB(tail)->ack_seq =
TCP_SKB_CB(skb)->ack_seq;
TCP_SKB_CB(tail)->tcp_flags |=
TCP_SKB_CB(skb)->tcp_flags;

if (TCP_SKB_CB(skb)->has_rxtstamp) {


Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue

2018-11-22 Thread Eric Dumazet
On Thu, Nov 22, 2018 at 10:01 AM Neal Cardwell  wrote:
>
> On Wed, Nov 21, 2018 at 12:52 PM Eric Dumazet  wrote:
> >
> > In case GRO is not as efficient as it should be or disabled,
> > we might have a user thread trapped in __release_sock() while
> > softirq handler flood packets up to the point we have to drop.
> >
> > This patch balances work done from user thread and softirq,
> > to give more chances to __release_sock() to complete its work.
> >
> > This also helps if we receive many ACK packets, since GRO
> > does not aggregate them.
>
> Would this coalesce duplicate incoming ACK packets? Is there a risk
> that this would eliminate incoming dupacks needed for fast recovery in
> non-SACK connections? Perhaps pure ACKs should only be coalesced if
> the ACK field is different?

Yes, I was considering properly filtering SACK as a refinement later [1]
but you raise a valid point for alien stacks that are not yet using SACK :/

[1] This version of the patch will not aggregate sacks since the
memcmp() on tcp options would fail.

Neal can you double check if cake_ack_filter() does not have the issue
you just mentioned ?


Re: [PATCH v1 net] lan743x: Enable driver to work with LAN7431

2018-11-22 Thread Andrew Lunn
On Wed, Nov 21, 2018 at 02:22:45PM -0500, Bryan Whitehead wrote:
> This driver was designed to work with both LAN7430 and LAN7431.
> The only difference between the two is the LAN7431 has support
> for external phy.
> 
> This change adds LAN7431 to the list of recognized devices
> supported by this driver.
> 
> fixes: driver won't load for LAN7431

Hi Bryan

There is a well defined format for Fixes:.

Fixes: 23f0703c125b ("lan743x: Add main source files for new lan743x driver")

   Andrew


Re: [RFC v4 4/5] netdev: add netdev_is_upper_master

2018-11-22 Thread Alexis Bauvin
Le 22 nov. 2018 à 18:14, David Ahern  a écrit :
> On 11/21/18 6:07 PM, Alexis Bauvin wrote:
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 93243479085f..12459036d0da 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -7225,6 +7225,23 @@ void netdev_lower_state_changed(struct net_device 
>> *lower_dev,
>> }
>> EXPORT_SYMBOL(netdev_lower_state_changed);
>> 
>> +/**
>> + * netdev_is_upper_master - Test if a device is a master, direct or 
>> indirect,
>> + *  of another one.
>> + * @dev: device to start looking from
>> + * @master: device to test if master of dev
>> + */
>> +bool netdev_is_upper_master(struct net_device *dev, struct net_device 
>> *master)
>> +{
>> +if (!dev)
>> +return false;
>> +
>> +if (dev->ifindex == master->ifindex)
> 
> dev == master should work as well without the dereference.

Ack, will add to next version.

>> +return true;
>> +return netdev_is_upper_master(netdev_master_upper_dev_get(dev), master);
>> +}
>> +EXPORT_SYMBOL(netdev_is_upper_master);
>> +
>> static void dev_change_rx_flags(struct net_device *dev, int flags)
>> {
>>  const struct net_device_ops *ops = dev->netdev_ops;
>> 


Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue

2018-11-22 Thread Neal Cardwell
On Wed, Nov 21, 2018 at 12:52 PM Eric Dumazet  wrote:
>
> In case GRO is not as efficient as it should be or disabled,
> we might have a user thread trapped in __release_sock() while
> softirq handler flood packets up to the point we have to drop.
>
> This patch balances work done from user thread and softirq,
> to give more chances to __release_sock() to complete its work.
>
> This also helps if we receive many ACK packets, since GRO
> does not aggregate them.

Would this coalesce duplicate incoming ACK packets? Is there a risk
that this would eliminate incoming dupacks needed for fast recovery in
non-SACK connections? Perhaps pure ACKs should only be coalesced if
the ACK field is different?

neal


Re: [RFC v4 3/5] vxlan: add support for underlay in non-default VRF

2018-11-22 Thread David Ahern
On 11/21/18 6:07 PM, Alexis Bauvin wrote:
> Creating a VXLAN device with is underlay in the non-default VRF makes
> egress route lookup fail or incorrect since it will resolve in the
> default VRF, and ingress fail because the socket listens in the default
> VRF.
> 
> This patch binds the underlying UDP tunnel socket to the l3mdev of the
> lower device of the VXLAN device. This will listen in the proper VRF and
> output traffic from said l3mdev, matching l3mdev routing rules and
> looking up the correct routing table.
> 
> When the VXLAN device does not have a lower device, or the lower device
> is in the default VRF, the socket will not be bound to any interface,
> keeping the previous behaviour.
> 
> The underlay l3mdev is deduced from the VXLAN lower device
> (IFLA_VXLAN_LINK).
> 
> +--+ +-+
> |  | | |
> | vrf-blue | | vrf-red |
> |  | | |
> ++-+ +++
>  ||
>  ||
> ++-+ +++
> |  | | |
> | br-blue  | | br-red  |
> |  | | |
> ++-+ +---+-+---+
>  |   | |
>  | +-+ +-+
>  | | |
> ++-++--++   +++
> |  |  lower device  |   |   | |
> |   eth0   | <- - - - - - - | vxlan-red |   | tap-red | (... more taps)
> |  ||   |   | |
> +--++---+   +-+
> 
> Signed-off-by: Alexis Bauvin 
> Reviewed-by: Amine Kherbouche 
> Tested-by: Amine Kherbouche 
> ---
>  drivers/net/vxlan.c   | 32 +--
>  .../selftests/net/test_vxlan_under_vrf.sh | 90 +++
>  2 files changed, 114 insertions(+), 8 deletions(-)
>  create mode 100755 tools/testing/selftests/net/test_vxlan_under_vrf.sh
> 

Reviewed-by: David Ahern 

Thanks for adding the test case; I'll try it out next week (after the
holidays).


Re: [RFC v4 4/5] netdev: add netdev_is_upper_master

2018-11-22 Thread David Ahern
On 11/21/18 6:07 PM, Alexis Bauvin wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 93243479085f..12459036d0da 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -7225,6 +7225,23 @@ void netdev_lower_state_changed(struct net_device 
> *lower_dev,
>  }
>  EXPORT_SYMBOL(netdev_lower_state_changed);
>  
> +/**
> + * netdev_is_upper_master - Test if a device is a master, direct or indirect,
> + *  of another one.
> + * @dev: device to start looking from
> + * @master: device to test if master of dev
> + */
> +bool netdev_is_upper_master(struct net_device *dev, struct net_device 
> *master)
> +{
> + if (!dev)
> + return false;
> +
> + if (dev->ifindex == master->ifindex)

dev == master should work as well without the dereference.

> + return true;
> + return netdev_is_upper_master(netdev_master_upper_dev_get(dev), master);
> +}
> +EXPORT_SYMBOL(netdev_is_upper_master);
> +
>  static void dev_change_rx_flags(struct net_device *dev, int flags)
>  {
>   const struct net_device_ops *ops = dev->netdev_ops;
> 



Re: [RFC v4 2/5] l3mdev: add function to retreive upper master

2018-11-22 Thread David Ahern
On 11/21/18 6:07 PM, Alexis Bauvin wrote:
> Existing functions to retreive the l3mdev of a device did not walk the
> master chain to find the upper master. This patch adds a function to
> find the l3mdev, even indirect through e.g. a bridge:
> 
>
...

> 
> This will properly resolve the l3mdev of eth0 to vrf-blue.
> 
> Signed-off-by: Alexis Bauvin 
> Reviewed-by: Amine Kherbouche 
> Tested-by: Amine Kherbouche 
> ---
>  include/net/l3mdev.h | 22 ++
>  net/l3mdev/l3mdev.c  | 18 ++
>  2 files changed, 40 insertions(+)
> 

Reviewed-by: David Ahern 




Re: [RFC v4 1/5] udp_tunnel: add config option to bind to a device

2018-11-22 Thread David Ahern
On 11/21/18 6:07 PM, Alexis Bauvin wrote:
> UDP tunnel sockets are always opened unbound to a specific device. This
> patch allow the socket to be bound on a custom device, which
> incidentally makes UDP tunnels VRF-aware if binding to an l3mdev.
> 
> Signed-off-by: Alexis Bauvin 
> Reviewed-by: Amine Kherbouche 
> Tested-by: Amine Kherbouche 
> ---
>  include/net/udp_tunnel.h  |  1 +
>  net/ipv4/udp_tunnel.c | 10 ++
>  net/ipv6/ip6_udp_tunnel.c |  9 +
>  3 files changed, 20 insertions(+)

Reviewed-by: David Ahern 





patchwork bug?

2018-11-22 Thread Nicolas Dichtel
Not sure if it's the right place to post that.

When I try to list patches with filters, something like this:
http://patchwork.ozlabs.org/project/netdev/list/?series==2036=*==both=34

I can see only page 1. When I click on '2', the page 1 is still displayed and
the page numerotation is removed.


Regards,
Nicolas


Re: consistency for statistics with XDP mode

2018-11-22 Thread Toke Høiland-Jørgensen
David Ahern  writes:

> On 11/22/18 1:26 AM, Toke Høiland-Jørgensen wrote:
>> Saeed Mahameed  writes:
>> 
> I'd say it sounds reasonable to include XDP in the normal traffic
> counters, but having the detailed XDP-specific counters is quite
> useful
> as well... So can't we do both (for all drivers)?
>
>>>
>>> What are you thinking ? 
>>> reporting XDP_DROP in interface dropped counter ?
>>> and XDP_TX/REDIRECT in the TX counter ?
>>> XDP_ABORTED in the  err/drop counter ?
>>>
>>> how about having a special XDP command in the .ndo_bpf that would query
>>> the standardized XDP stats ?
>> 
>> Don't have any strong opinions on the mechanism; just pointing out that
>> the XDP-specific stats are useful to have separately as well :)
>>
>
> I would like to see basic packets, bytes, and dropped counters tracked
> for Rx and Tx via the standard netdev counters for all devices. This is
> for ease in accounting as well as speed and simplicity for bumping
> counters for virtual devices from bpf helpers.
>
> From there, the XDP ones can be in the driver private stats as they are
> currently but with some consistency across drivers for redirects, drops,
> any thing else.
>
> So not a radical departure from where we are today, just getting the
> agreement for consistency and driver owners to make the changes.

Sounds good to me :)

-Toke


Re: [PATCH net-next,v3 12/12] qede: use ethtool_rx_flow_rule() to remove duplicated parser code

2018-11-22 Thread Marcelo Ricardo Leitner
On Wed, Nov 21, 2018 at 03:51:32AM +0100, Pablo Neira Ayuso wrote:
...
>  static int
>  qede_parse_flower_attr(struct qede_dev *edev, __be16 proto,
> -struct tc_cls_flower_offload *f,
> -struct qede_arfs_tuple *tuple)
> +struct flow_rule *rule, struct qede_arfs_tuple *tuple)

What about s/qede_parse_flower_attr/qede_parse_flow_attr/ or so? As it
is not about flower anymore.

It also helps here:

> -static int qede_flow_spec_to_tuple(struct qede_dev *edev,
> -struct qede_arfs_tuple *t,
> -struct ethtool_rx_flow_spec *fs)
> +static int qede_flow_spec_to_rule(struct qede_dev *edev,
> +   struct qede_arfs_tuple *t,
> +   struct ethtool_rx_flow_spec *fs)
>  {
...
> +
> + if (qede_parse_flower_attr(edev, proto, flow->rule, t)) {
> + err = -EINVAL;
> + goto err_out;
> + }
> +
> + /* Make sure location is valid and filter isn't already set */
> + err = qede_flow_spec_validate(edev, >rule->action, t,
> +   fs->location);
...



Re: consistency for statistics with XDP mode

2018-11-22 Thread David Ahern
On 11/22/18 1:26 AM, Toke Høiland-Jørgensen wrote:
> Saeed Mahameed  writes:
> 
 I'd say it sounds reasonable to include XDP in the normal traffic
 counters, but having the detailed XDP-specific counters is quite
 useful
 as well... So can't we do both (for all drivers)?

>>
>> What are you thinking ? 
>> reporting XDP_DROP in interface dropped counter ?
>> and XDP_TX/REDIRECT in the TX counter ?
>> XDP_ABORTED in the  err/drop counter ?
>>
>> how about having a special XDP command in the .ndo_bpf that would query
>> the standardized XDP stats ?
> 
> Don't have any strong opinions on the mechanism; just pointing out that
> the XDP-specific stats are useful to have separately as well :)
>

I would like to see basic packets, bytes, and dropped counters tracked
for Rx and Tx via the standard netdev counters for all devices. This is
for ease in accounting as well as speed and simplicity for bumping
counters for virtual devices from bpf helpers.

>From there, the XDP ones can be in the driver private stats as they are
currently but with some consistency across drivers for redirects, drops,
any thing else.

So not a radical departure from where we are today, just getting the
agreement for consistency and driver owners to make the changes.


Re: consistency for statistics with XDP mode

2018-11-22 Thread David Ahern
On 11/21/18 5:53 PM, Toshiaki Makita wrote:
>> We really need consistency in the counters and at a minimum, users
>> should be able to track packet and byte counters for both Rx and Tx
>> including XDP.
>>
>> It seems to me the Rx and Tx packet, byte and dropped counters returned
>> for the standard device stats (/proc/net/dev, ip -s li show, ...) should
>> include all packets managed by the driver regardless of whether they are
>> forwarded / dropped in XDP or go up the Linux stack. This also aligns
> 
> Agreed. When I introduced virtio_net XDP counters, I just forgot to
> update tx packets/bytes counters on ndo_xdp_xmit. Probably I thought it
> is handled by free_old_xmit_skbs.

Do you have some time to look at adding the Tx counters to virtio_net?



Re: [PATCH net-next v2 5/5] netns: enable to dump full nsid translation table

2018-11-22 Thread Nicolas Dichtel
Le 22/11/2018 à 17:40, David Ahern a écrit :
> On 11/22/18 8:50 AM, Nicolas Dichtel wrote:
>> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
>> index dd25fb22ad45..25030e0317a2 100644
>> --- a/net/core/net_namespace.c
>> +++ b/net/core/net_namespace.c
>> @@ -745,6 +745,8 @@ struct net_fill_args {
>>  int flags;
>>  int cmd;
>>  int nsid;
>> +bool add_ref;
>> +int ref_nsid;
>>  };
>>  
>>  static int rtnl_net_fill(struct sk_buff *skb, struct net_fill_args *args)
>> @@ -763,6 +765,10 @@ static int rtnl_net_fill(struct sk_buff *skb, struct 
>> net_fill_args *args)
>>  if (nla_put_s32(skb, NETNSA_NSID, args->nsid))
>>  goto nla_put_failure;
>>  
>> +if (args->add_ref &&
>> +nla_put_s32(skb, NETNSA_CURRENT_NSID, args->ref_nsid))
>> +goto nla_put_failure;
>> +
> 
> you need to add NETNSA_CURRENT_NSID to rtnl_net_get_size.
> 
Good catch.
I thought to this and I forgot at the end :/


Re: [PATCH net-next v2 5/5] netns: enable to dump full nsid translation table

2018-11-22 Thread David Ahern
On 11/22/18 8:50 AM, Nicolas Dichtel wrote:
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index dd25fb22ad45..25030e0317a2 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -745,6 +745,8 @@ struct net_fill_args {
>   int flags;
>   int cmd;
>   int nsid;
> + bool add_ref;
> + int ref_nsid;
>  };
>  
>  static int rtnl_net_fill(struct sk_buff *skb, struct net_fill_args *args)
> @@ -763,6 +765,10 @@ static int rtnl_net_fill(struct sk_buff *skb, struct 
> net_fill_args *args)
>   if (nla_put_s32(skb, NETNSA_NSID, args->nsid))
>   goto nla_put_failure;
>  
> + if (args->add_ref &&
> + nla_put_s32(skb, NETNSA_CURRENT_NSID, args->ref_nsid))
> + goto nla_put_failure;
> +

you need to add NETNSA_CURRENT_NSID to rtnl_net_get_size.


[PATCH bpf] bpf: Fix integer overflow in queue_stack_map_alloc.

2018-11-22 Thread ww9210
Integer overflow in queue_stack_map_alloc when calculating size may lead to 
heap overflow of arbitrary length.
The patch fix it by checking whether attr->max_entries+1 < attr->max_entries 
and bailing out if it is the case.
The vulnerability is discovered with the assistance of syzkaller.

Reported-by: Wei Wu 
To: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: netdev 
Cc: Eric Dumazet 
Cc: Greg KH 
Signed-off-by: Wei Wu 
---
 kernel/bpf/queue_stack_maps.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index 8bbd72d3a121..c35a8a4721c8 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -67,6 +67,8 @@ static struct bpf_map *queue_stack_map_alloc(union bpf_attr 
*attr)
u64 queue_size, cost;
 
size = attr->max_entries + 1;
+   if (size < attr->max_entries)
+   return ERR_PTR(-EINVAL);
value_size = attr->value_size;
 
queue_size = sizeof(*qs) + (u64) value_size * size;
-- 
2.17.1



Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue

2018-11-22 Thread Yuchung Cheng
On Wed, Nov 21, 2018 at 2:40 PM, Eric Dumazet  wrote:
>
>
> On 11/21/2018 02:31 PM, Yuchung Cheng wrote:
>> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet  wrote:
>
>>> +
>> Really nice! would it make sense to re-use (some of) the similar
>> tcp_try_coalesce()?
>>
>
> Maybe, but it is a bit complex, since skbs in receive queues (regular or out 
> of order)
> are accounted differently (they have skb->destructor set)
>
> Also they had the TCP header pulled already, while the backlog coalescing 
> also has
> to make sure TCP options match.
>
> Not sure if we want to add extra parameters and conditional checks...
Makes sense.

Acked-by: Yuchung Cheng 

>
>


Re: [PATCH net-next,v3 04/12] cls_api: add translator to flow_action representation

2018-11-22 Thread Marcelo Ricardo Leitner
On Wed, Nov 21, 2018 at 03:51:24AM +0100, Pablo Neira Ayuso wrote:
...
> +int tc_setup_flow_action(struct flow_action *flow_action,
> +  const struct tcf_exts *exts)
> +{
> + const struct tc_action *act;
> + int i, j, k;
> +
> + if (!exts)
> + return 0;
> +
> + j = 0;
> + tcf_exts_for_each_action(i, act, exts) {
> + struct flow_action_entry *key;
   ^  ^^^

> +
> + key = _action->entries[j];
^^^ ^^^

Considering previous changes, what about a s/key/entry/ in the
variable name here too?

> + if (is_tcf_gact_ok(act)) {
> + key->id = FLOW_ACTION_ACCEPT;
> + } else if (is_tcf_gact_shot(act)) {
> + key->id = FLOW_ACTION_DROP;
> + } else if (is_tcf_gact_trap(act)) {
> + key->id = FLOW_ACTION_TRAP;
> + } else if (is_tcf_gact_goto_chain(act)) {
> + key->id = FLOW_ACTION_GOTO;
> + key->chain_index = tcf_gact_goto_chain_index(act);
> + } else if (is_tcf_mirred_egress_redirect(act)) {
> + key->id = FLOW_ACTION_REDIRECT;
> + key->dev = tcf_mirred_dev(act);
> + } else if (is_tcf_mirred_egress_mirror(act)) {
> + key->id = FLOW_ACTION_MIRRED;
> + key->dev = tcf_mirred_dev(act);
> + } else if (is_tcf_vlan(act)) {
> + switch (tcf_vlan_action(act)) {
> + case TCA_VLAN_ACT_PUSH:
> + key->id = FLOW_ACTION_VLAN_PUSH;
> + key->vlan.vid = tcf_vlan_push_vid(act);
> + key->vlan.proto = tcf_vlan_push_proto(act);
> + key->vlan.prio = tcf_vlan_push_prio(act);
> + break;
> + case TCA_VLAN_ACT_POP:
> + key->id = FLOW_ACTION_VLAN_POP;
> + break;
> + case TCA_VLAN_ACT_MODIFY:
> + key->id = FLOW_ACTION_VLAN_MANGLE;
> + key->vlan.vid = tcf_vlan_push_vid(act);
> + key->vlan.proto = tcf_vlan_push_proto(act);
> + key->vlan.prio = tcf_vlan_push_prio(act);
> + break;
> + default:
> + goto err_out;
> + }
> + } else if (is_tcf_tunnel_set(act)) {
> + key->id = FLOW_ACTION_TUNNEL_ENCAP;
> + key->tunnel = tcf_tunnel_info(act);
> + } else if (is_tcf_tunnel_release(act)) {
> + key->id = FLOW_ACTION_TUNNEL_DECAP;
> + key->tunnel = tcf_tunnel_info(act);
> + } else if (is_tcf_pedit(act)) {
> + for (k = 0; k < tcf_pedit_nkeys(act); k++) {
> + switch (tcf_pedit_cmd(act, k)) {
> + case TCA_PEDIT_KEY_EX_CMD_SET:
> + key->id = FLOW_ACTION_MANGLE;
> + break;
> + case TCA_PEDIT_KEY_EX_CMD_ADD:
> + key->id = FLOW_ACTION_ADD;
> + break;
> + default:
> + goto err_out;
> + }
> + key->mangle.htype = tcf_pedit_htype(act, k);
> + key->mangle.mask = tcf_pedit_mask(act, k);
> + key->mangle.val = tcf_pedit_val(act, k);
> + key->mangle.offset = tcf_pedit_offset(act, k);
> + key = _action->entries[++j];
> + }
> + } else if (is_tcf_csum(act)) {
> + key->id = FLOW_ACTION_CSUM;
> + key->csum_flags = tcf_csum_update_flags(act);
> + } else if (is_tcf_skbedit_mark(act)) {
> + key->id = FLOW_ACTION_MARK;
> + key->mark = tcf_skbedit_mark(act);
> + } else {
> + goto err_out;
> + }
> +
> + if (!is_tcf_pedit(act))
> + j++;
> + }
> + return 0;
> +err_out:
> + return -EOPNOTSUPP;
> +}
> +EXPORT_SYMBOL(tc_setup_flow_action);


Re: [PATCH net-next v2 3/5] netns: add support of NETNSA_TARGET_NSID

2018-11-22 Thread David Ahern
On 11/22/18 8:50 AM, Nicolas Dichtel wrote:
> Like it was done for link and address, add the ability to perform get/dump
> in another netns by specifying a target nsid attribute.
> 
> Signed-off-by: Nicolas Dichtel 
> ---
>  include/uapi/linux/net_namespace.h |  1 +
>  net/core/net_namespace.c   | 86 ++
>  2 files changed, 76 insertions(+), 11 deletions(-)

Reviewed-by: David Ahern 



Re: [PATCH net-next v2 4/5] netns: enable to specify a nsid for a get request

2018-11-22 Thread David Ahern
On 11/22/18 8:50 AM, Nicolas Dichtel wrote:
> Combined with NETNSA_TARGET_NSID, it enables to "translate" a nsid from one
> netns to a nsid of another netns.
> This is useful when using NETLINK_F_LISTEN_ALL_NSID because it helps the
> user to interpret a nsid received from an other netns.
> 
> Signed-off-by: Nicolas Dichtel 
> ---
>  net/core/net_namespace.c | 5 +
>  1 file changed, 5 insertions(+)
> 

Reviewed-by: David Ahern 



Re: [PATCH net-next v2 2/5] netns: introduce 'struct net_fill_args'

2018-11-22 Thread David Ahern
On 11/22/18 8:50 AM, Nicolas Dichtel wrote:
> This is a preparatory work. To avoid having to much arguments for the
> function rtnl_net_fill(), a new structure is defined.
> 
> Signed-off-by: Nicolas Dichtel 
> ---
>  net/core/net_namespace.c | 48 
>  1 file changed, 34 insertions(+), 14 deletions(-)
> 

Reviewed-by: David Ahern 


Re: [PATCH net-next,v3 00/12] add flow_rule infrastructure

2018-11-22 Thread Marcelo Ricardo Leitner
On Wed, Nov 21, 2018 at 03:51:20AM +0100, Pablo Neira Ayuso wrote:
> Hi,
> 
> This patchset is the third iteration [1] [2] [3] to introduce a kernel
> intermediate (IR) to express ACL hardware offloads.

On v2 cover letter you had:

"""
However, cost of this layer is very small, adding 1 million rules via
tc -batch, perf shows:

 0.06%  tc   [kernel.vmlinux][k] tc_setup_flow_action
"""

The above doesn't include time spent on children calls and I'm worried
about the new allocation done by flow_rule_alloc(), as it can impact
rule insertion rate. I'll run some tests here and report back.



Re: [PATCH net-next v2 1/5] netns: remove net arg from rtnl_net_fill()

2018-11-22 Thread David Ahern
On 11/22/18 8:50 AM, Nicolas Dichtel wrote:
> This argument is not used anymore.
> 
> Fixes: cab3c8ec8d57 ("netns: always provide the id to rtnl_net_fill()")
> Signed-off-by: Nicolas Dichtel 
> ---
>  net/core/net_namespace.c | 10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 

Reviewed-by: David Ahern 



Re: [PATCH bpf] bpf: Fix integer overflow in queue_stack_map_alloc.

2018-11-22 Thread Greg KH
On Thu, Nov 22, 2018 at 11:59:02PM +0800, Wei Wu wrote:
> Integer overflow in queue_stack_map_alloc when calculating size may
> lead to heap overflow of arbitrary length.
> The patch fix it by checking whether attr->max_entries+1 <
> attr->max_entries and bailing out if it is the case.
> The vulnerability is discovered with the assistance of syzkaller.
> 
> Reported-by: Wei Wu 
> To: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: netdev 
> Cc: Eric Dumazet 
> Cc: Greg KH 
> Signed-off-by: Wei Wu 
> ---
>  kernel/bpf/queue_stack_maps.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
> index 8bbd72d3a121..c35a8a4721c8 100644
> --- a/kernel/bpf/queue_stack_maps.c
> +++ b/kernel/bpf/queue_stack_maps.c
> @@ -67,6 +67,8 @@ static struct bpf_map *queue_stack_map_alloc(union
> bpf_attr *attr)
>   u64 queue_size, cost;
> 
>   size = attr->max_entries + 1;
> + if (size < attr->max_entries)
> + return ERR_PTR(-EINVAL);
>   value_size = attr->value_size;

Your tabs got eaten by your email client and they all disappeared,
making the patch impossible to apply :(

Care to try again?

thanks,

greg k-h


Re: [PATCH net-next 2/2] net: bridge: add no_linklocal_learn bool option

2018-11-22 Thread Andrew Lunn
>  int br_boolopt_get(const struct net_bridge *br, enum br_boolopt_id opt)
>  {
> - int optval = 0;
> -
>   switch (opt) {
> + case BR_BOOLOPT_NO_LL_LEARN:
> + return br_opt_get(br, BROPT_NO_LL_LEARN);
>   default:
>   break;
>   }
>  
> - return optval;
> + return 0;
>  }

It seems like 1/2 of that change belongs in the previous patch.

> --- a/net/bridge/br_sysfs_br.c
> +++ b/net/bridge/br_sysfs_br.c
> @@ -328,6 +328,27 @@ static ssize_t flush_store(struct device *d,
>  }
>  static DEVICE_ATTR_WO(flush);
>  
> +static ssize_t no_linklocal_learn_show(struct device *d,
> +struct device_attribute *attr,
> +char *buf)
> +{
> + struct net_bridge *br = to_bridge(d);
> + return sprintf(buf, "%d\n", br_boolopt_get(br, BR_BOOLOPT_NO_LL_LEARN));
> +}
> +
> +static int set_no_linklocal_learn(struct net_bridge *br, unsigned long val)
> +{
> + return br_boolopt_toggle(br, BR_BOOLOPT_NO_LL_LEARN, !!val);
> +}
> +
> +static ssize_t no_linklocal_learn_store(struct device *d,
> + struct device_attribute *attr,
> + const char *buf, size_t len)
> +{
> + return store_bridge_parm(d, buf, len, set_no_linklocal_learn);
> +}
> +static DEVICE_ATTR_RW(no_linklocal_learn);

I thought we where trying to move away from sysfs? Do we need to add
new options here? It seems like forcing people to use iproute2 for
newer options is a good way to get people to convert to iproute2.

Andrew


[PATCH] bpf: fix check of allowed specifiers in bpf_trace_printk

2018-11-22 Thread Martynas Pumputis
A format string consisting of "%p" or "%s" followed by an invalid
specifier (e.g. "%p%\n" or "%s%") could pass the check which
would make format_decode (lib/vsprintf.c) to warn.

Reported-by: syzbot+1ec5c5ec949c4adaa...@syzkaller.appspotmail.com
Signed-off-by: Martynas Pumputis 
---
 kernel/trace/bpf_trace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 08fcfe440c63..9ab05736e1a1 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -225,6 +225,8 @@ BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, 
u64, arg1,
(void *) (long) unsafe_addr,
sizeof(buf));
}
+   if (fmt[i] == '%')
+   i--;
continue;
}
 
-- 
2.19.1



[PATCH bpf] bpf: Fix integer overflow in queue_stack_map_alloc.

2018-11-22 Thread Wei Wu
Integer overflow in queue_stack_map_alloc when calculating size may
lead to heap overflow of arbitrary length.
The patch fix it by checking whether attr->max_entries+1 <
attr->max_entries and bailing out if it is the case.
The vulnerability is discovered with the assistance of syzkaller.

Reported-by: Wei Wu 
To: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: netdev 
Cc: Eric Dumazet 
Cc: Greg KH 
Signed-off-by: Wei Wu 
---
 kernel/bpf/queue_stack_maps.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index 8bbd72d3a121..c35a8a4721c8 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -67,6 +67,8 @@ static struct bpf_map *queue_stack_map_alloc(union
bpf_attr *attr)
  u64 queue_size, cost;

  size = attr->max_entries + 1;
+ if (size < attr->max_entries)
+ return ERR_PTR(-EINVAL);
  value_size = attr->value_size;

  queue_size = sizeof(*qs) + (u64) value_size * size;
--
2.17.1


[PATCH net-next v2 0/5] Ease to interpret net-nsid

2018-11-22 Thread Nicolas Dichtel
The goal of this series is to ease the interpretation of nsid received in
netlink messages from other netns (when the user uses
NETLINK_F_LISTEN_ALL_NSID).

After this series, with a patched iproute2:

$ ip netns add foo
$ ip netns add bar
$ touch /var/run/netns/init_net
$ mount --bind /proc/1/ns/net /var/run/netns/init_net
$ ip netns set init_net 11
$ ip netns set foo 12
$ ip netns set bar 13
$ ip netns
init_net (id: 11)
bar (id: 13)
foo (id: 12)
$ ip -n foo netns set init_net 21
$ ip -n foo netns set foo 22
$ ip -n foo netns set bar 23
$ ip -n foo netns
init_net (id: 21)
bar (id: 23)
foo (id: 22)
$ ip -n bar netns set init_net 31
$ ip -n bar netns set foo 32
$ ip -n bar netns set bar 33
$ ip -n bar netns
init_net (id: 31)
bar (id: 33)
foo (id: 32)
$ ip netns list-id target-nsid 12
nsid 21 current-nsid 11 (iproute2 netns name: init_net)
nsid 22 current-nsid 12 (iproute2 netns name: foo)
nsid 23 current-nsid 13 (iproute2 netns name: bar)
$ ip -n bar netns list-id target-nsid 32 nsid 31
nsid 21 current-nsid 31 (iproute2 netns name: init_net)

v1 -> v2:
  - patch 1/5: remove net from struct rtnl_net_dump_cb
  - patch 2/5: new in this version
  - patch 3/5: use a bool to know if rtnl_get_net_ns_capable() was called
  - patch 5/5: use struct net_fill_args

 include/uapi/linux/net_namespace.h |   2 +
 net/core/net_namespace.c   | 157 +++--
 2 files changed, 133 insertions(+), 26 deletions(-)

Comments are welcomed,
Regards,
Nicolas



[PATCH net-next v2 3/5] netns: add support of NETNSA_TARGET_NSID

2018-11-22 Thread Nicolas Dichtel
Like it was done for link and address, add the ability to perform get/dump
in another netns by specifying a target nsid attribute.

Signed-off-by: Nicolas Dichtel 
---
 include/uapi/linux/net_namespace.h |  1 +
 net/core/net_namespace.c   | 86 ++
 2 files changed, 76 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/net_namespace.h 
b/include/uapi/linux/net_namespace.h
index 0187c74d8889..0ed9dd61d32a 100644
--- a/include/uapi/linux/net_namespace.h
+++ b/include/uapi/linux/net_namespace.h
@@ -16,6 +16,7 @@ enum {
NETNSA_NSID,
NETNSA_PID,
NETNSA_FD,
+   NETNSA_TARGET_NSID,
__NETNSA_MAX,
 };
 
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index f8a5966b086c..885c54197e31 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -669,6 +669,7 @@ static const struct nla_policy rtnl_net_policy[NETNSA_MAX + 
1] = {
[NETNSA_NSID]   = { .type = NLA_S32 },
[NETNSA_PID]= { .type = NLA_U32 },
[NETNSA_FD] = { .type = NLA_U32 },
+   [NETNSA_TARGET_NSID]= { .type = NLA_S32 },
 };
 
 static int rtnl_net_newid(struct sk_buff *skb, struct nlmsghdr *nlh,
@@ -780,9 +781,10 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
.seq = nlh->nlmsg_seq,
.cmd = RTM_NEWNSID,
};
+   struct net *peer, *target = net;
+   bool put_target = false;
struct nlattr *nla;
struct sk_buff *msg;
-   struct net *peer;
int err;
 
err = nlmsg_parse(nlh, sizeof(struct rtgenmsg), tb, NETNSA_MAX,
@@ -806,13 +808,27 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
return PTR_ERR(peer);
}
 
+   if (tb[NETNSA_TARGET_NSID]) {
+   int id = nla_get_s32(tb[NETNSA_TARGET_NSID]);
+
+   target = rtnl_get_net_ns_capable(NETLINK_CB(skb).sk, id);
+   if (IS_ERR(target)) {
+   NL_SET_BAD_ATTR(extack, tb[NETNSA_TARGET_NSID]);
+   NL_SET_ERR_MSG(extack,
+  "Target netns reference is invalid");
+   err = PTR_ERR(target);
+   goto out;
+   }
+   put_target = true;
+   }
+
msg = nlmsg_new(rtnl_net_get_size(), GFP_KERNEL);
if (!msg) {
err = -ENOMEM;
goto out;
}
 
-   fillargs.nsid = peernet2id(net, peer);
+   fillargs.nsid = peernet2id(target, peer);
err = rtnl_net_fill(msg, );
if (err < 0)
goto err_out;
@@ -823,15 +839,19 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 err_out:
nlmsg_free(msg);
 out:
+   if (put_target)
+   put_net(target);
put_net(peer);
return err;
 }
 
 struct rtnl_net_dump_cb {
+   struct net *tgt_net;
struct sk_buff *skb;
struct net_fill_args fillargs;
int idx;
int s_idx;
+   bool put_tgt_net;
 };
 
 static int rtnl_net_dumpid_one(int id, void *peer, void *data)
@@ -852,10 +872,50 @@ static int rtnl_net_dumpid_one(int id, void *peer, void 
*data)
return 0;
 }
 
+static int rtnl_valid_dump_net_req(const struct nlmsghdr *nlh, struct sock *sk,
+  struct rtnl_net_dump_cb *net_cb,
+  struct netlink_callback *cb)
+{
+   struct netlink_ext_ack *extack = cb->extack;
+   struct nlattr *tb[NETNSA_MAX + 1];
+   int err, i;
+
+   err = nlmsg_parse_strict(nlh, sizeof(struct rtgenmsg), tb, NETNSA_MAX,
+rtnl_net_policy, extack);
+   if (err < 0)
+   return err;
+
+   for (i = 0; i <= NETNSA_MAX; i++) {
+   if (!tb[i])
+   continue;
+
+   if (i == NETNSA_TARGET_NSID) {
+   struct net *net;
+
+   net = rtnl_get_net_ns_capable(sk, nla_get_s32(tb[i]));
+   if (IS_ERR(net)) {
+   NL_SET_BAD_ATTR(extack, tb[i]);
+   NL_SET_ERR_MSG(extack,
+  "Invalid target network 
namespace id");
+   return PTR_ERR(net);
+   }
+   net_cb->tgt_net = net;
+   net_cb->put_tgt_net = true;
+   } else {
+   NL_SET_BAD_ATTR(extack, tb[i]);
+   NL_SET_ERR_MSG(extack,
+  "Unsupported attribute in dump request");
+   return -EINVAL;
+   }
+   }
+
+   return 0;
+}
+
 static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb)
 {
-   struct net *net = sock_net(skb->sk);
struct rtnl_net_dump_cb net_cb = {
+ 

[PATCH net-next v2 1/5] netns: remove net arg from rtnl_net_fill()

2018-11-22 Thread Nicolas Dichtel
This argument is not used anymore.

Fixes: cab3c8ec8d57 ("netns: always provide the id to rtnl_net_fill()")
Signed-off-by: Nicolas Dichtel 
---
 net/core/net_namespace.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index fefe72774aeb..52b9620e3457 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -739,7 +739,7 @@ static int rtnl_net_get_size(void)
 }
 
 static int rtnl_net_fill(struct sk_buff *skb, u32 portid, u32 seq, int flags,
-int cmd, struct net *net, int nsid)
+int cmd, int nsid)
 {
struct nlmsghdr *nlh;
struct rtgenmsg *rth;
@@ -801,7 +801,7 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 
id = peernet2id(net, peer);
err = rtnl_net_fill(msg, NETLINK_CB(skb).portid, nlh->nlmsg_seq, 0,
-   RTM_NEWNSID, net, id);
+   RTM_NEWNSID, id);
if (err < 0)
goto err_out;
 
@@ -816,7 +816,6 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 }
 
 struct rtnl_net_dump_cb {
-   struct net *net;
struct sk_buff *skb;
struct netlink_callback *cb;
int idx;
@@ -833,7 +832,7 @@ static int rtnl_net_dumpid_one(int id, void *peer, void 
*data)
 
ret = rtnl_net_fill(net_cb->skb, NETLINK_CB(net_cb->cb->skb).portid,
net_cb->cb->nlh->nlmsg_seq, NLM_F_MULTI,
-   RTM_NEWNSID, net_cb->net, id);
+   RTM_NEWNSID, id);
if (ret < 0)
return ret;
 
@@ -846,7 +845,6 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct 
netlink_callback *cb)
 {
struct net *net = sock_net(skb->sk);
struct rtnl_net_dump_cb net_cb = {
-   .net = net,
.skb = skb,
.cb = cb,
.idx = 0,
@@ -876,7 +874,7 @@ static void rtnl_net_notifyid(struct net *net, int cmd, int 
id)
if (!msg)
goto out;
 
-   err = rtnl_net_fill(msg, 0, 0, 0, cmd, net, id);
+   err = rtnl_net_fill(msg, 0, 0, 0, cmd, id);
if (err < 0)
goto err_out;
 
-- 
2.18.0



[PATCH net-next v2 2/5] netns: introduce 'struct net_fill_args'

2018-11-22 Thread Nicolas Dichtel
This is a preparatory work. To avoid having to much arguments for the
function rtnl_net_fill(), a new structure is defined.

Signed-off-by: Nicolas Dichtel 
---
 net/core/net_namespace.c | 48 
 1 file changed, 34 insertions(+), 14 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 52b9620e3457..f8a5966b086c 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -738,20 +738,28 @@ static int rtnl_net_get_size(void)
   ;
 }
 
-static int rtnl_net_fill(struct sk_buff *skb, u32 portid, u32 seq, int flags,
-int cmd, int nsid)
+struct net_fill_args {
+   u32 portid;
+   u32 seq;
+   int flags;
+   int cmd;
+   int nsid;
+};
+
+static int rtnl_net_fill(struct sk_buff *skb, struct net_fill_args *args)
 {
struct nlmsghdr *nlh;
struct rtgenmsg *rth;
 
-   nlh = nlmsg_put(skb, portid, seq, cmd, sizeof(*rth), flags);
+   nlh = nlmsg_put(skb, args->portid, args->seq, args->cmd, sizeof(*rth),
+   args->flags);
if (!nlh)
return -EMSGSIZE;
 
rth = nlmsg_data(nlh);
rth->rtgen_family = AF_UNSPEC;
 
-   if (nla_put_s32(skb, NETNSA_NSID, nsid))
+   if (nla_put_s32(skb, NETNSA_NSID, args->nsid))
goto nla_put_failure;
 
nlmsg_end(skb, nlh);
@@ -767,10 +775,15 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 {
struct net *net = sock_net(skb->sk);
struct nlattr *tb[NETNSA_MAX + 1];
+   struct net_fill_args fillargs = {
+   .portid = NETLINK_CB(skb).portid,
+   .seq = nlh->nlmsg_seq,
+   .cmd = RTM_NEWNSID,
+   };
struct nlattr *nla;
struct sk_buff *msg;
struct net *peer;
-   int err, id;
+   int err;
 
err = nlmsg_parse(nlh, sizeof(struct rtgenmsg), tb, NETNSA_MAX,
  rtnl_net_policy, extack);
@@ -799,9 +812,8 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
}
 
-   id = peernet2id(net, peer);
-   err = rtnl_net_fill(msg, NETLINK_CB(skb).portid, nlh->nlmsg_seq, 0,
-   RTM_NEWNSID, id);
+   fillargs.nsid = peernet2id(net, peer);
+   err = rtnl_net_fill(msg, );
if (err < 0)
goto err_out;
 
@@ -817,7 +829,7 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 
 struct rtnl_net_dump_cb {
struct sk_buff *skb;
-   struct netlink_callback *cb;
+   struct net_fill_args fillargs;
int idx;
int s_idx;
 };
@@ -830,9 +842,8 @@ static int rtnl_net_dumpid_one(int id, void *peer, void 
*data)
if (net_cb->idx < net_cb->s_idx)
goto cont;
 
-   ret = rtnl_net_fill(net_cb->skb, NETLINK_CB(net_cb->cb->skb).portid,
-   net_cb->cb->nlh->nlmsg_seq, NLM_F_MULTI,
-   RTM_NEWNSID, id);
+   net_cb->fillargs.nsid = id;
+   ret = rtnl_net_fill(net_cb->skb, _cb->fillargs);
if (ret < 0)
return ret;
 
@@ -846,7 +857,12 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct 
netlink_callback *cb)
struct net *net = sock_net(skb->sk);
struct rtnl_net_dump_cb net_cb = {
.skb = skb,
-   .cb = cb,
+   .fillargs = {
+   .portid = NETLINK_CB(cb->skb).portid,
+   .seq = cb->nlh->nlmsg_seq,
+   .flags = NLM_F_MULTI,
+   .cmd = RTM_NEWNSID,
+   },
.idx = 0,
.s_idx = cb->args[0],
};
@@ -867,6 +883,10 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct 
netlink_callback *cb)
 
 static void rtnl_net_notifyid(struct net *net, int cmd, int id)
 {
+   struct net_fill_args fillargs = {
+   .cmd = cmd,
+   .nsid = id,
+   };
struct sk_buff *msg;
int err = -ENOMEM;
 
@@ -874,7 +894,7 @@ static void rtnl_net_notifyid(struct net *net, int cmd, int 
id)
if (!msg)
goto out;
 
-   err = rtnl_net_fill(msg, 0, 0, 0, cmd, id);
+   err = rtnl_net_fill(msg, );
if (err < 0)
goto err_out;
 
-- 
2.18.0



[PATCH net-next v2 4/5] netns: enable to specify a nsid for a get request

2018-11-22 Thread Nicolas Dichtel
Combined with NETNSA_TARGET_NSID, it enables to "translate" a nsid from one
netns to a nsid of another netns.
This is useful when using NETLINK_F_LISTEN_ALL_NSID because it helps the
user to interpret a nsid received from an other netns.

Signed-off-by: Nicolas Dichtel 
---
 net/core/net_namespace.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 885c54197e31..dd25fb22ad45 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -797,6 +797,11 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
} else if (tb[NETNSA_FD]) {
peer = get_net_ns_by_fd(nla_get_u32(tb[NETNSA_FD]));
nla = tb[NETNSA_FD];
+   } else if (tb[NETNSA_NSID]) {
+   peer = get_net_ns_by_id(net, nla_get_u32(tb[NETNSA_NSID]));
+   if (!peer)
+   peer = ERR_PTR(-ENOENT);
+   nla = tb[NETNSA_NSID];
} else {
NL_SET_ERR_MSG(extack, "Peer netns reference is missing");
return -EINVAL;
-- 
2.18.0



[PATCH net-next v2 5/5] netns: enable to dump full nsid translation table

2018-11-22 Thread Nicolas Dichtel
Like the previous patch, the goal is to ease to convert nsids from one
netns to another netns.
A new attribute (NETNSA_CURRENT_NSID) is added to the kernel answer when
NETNSA_TARGET_NSID is provided, thus the user can easily convert nsids.

Signed-off-by: Nicolas Dichtel 
---
 include/uapi/linux/net_namespace.h |  1 +
 net/core/net_namespace.c   | 30 --
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/net_namespace.h 
b/include/uapi/linux/net_namespace.h
index 0ed9dd61d32a..9f9956809565 100644
--- a/include/uapi/linux/net_namespace.h
+++ b/include/uapi/linux/net_namespace.h
@@ -17,6 +17,7 @@ enum {
NETNSA_PID,
NETNSA_FD,
NETNSA_TARGET_NSID,
+   NETNSA_CURRENT_NSID,
__NETNSA_MAX,
 };
 
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index dd25fb22ad45..25030e0317a2 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -745,6 +745,8 @@ struct net_fill_args {
int flags;
int cmd;
int nsid;
+   bool add_ref;
+   int ref_nsid;
 };
 
 static int rtnl_net_fill(struct sk_buff *skb, struct net_fill_args *args)
@@ -763,6 +765,10 @@ static int rtnl_net_fill(struct sk_buff *skb, struct 
net_fill_args *args)
if (nla_put_s32(skb, NETNSA_NSID, args->nsid))
goto nla_put_failure;
 
+   if (args->add_ref &&
+   nla_put_s32(skb, NETNSA_CURRENT_NSID, args->ref_nsid))
+   goto nla_put_failure;
+
nlmsg_end(skb, nlh);
return 0;
 
@@ -782,7 +788,6 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
.cmd = RTM_NEWNSID,
};
struct net *peer, *target = net;
-   bool put_target = false;
struct nlattr *nla;
struct sk_buff *msg;
int err;
@@ -824,7 +829,8 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
err = PTR_ERR(target);
goto out;
}
-   put_target = true;
+   fillargs.add_ref = true;
+   fillargs.ref_nsid = peernet2id(net, peer);
}
 
msg = nlmsg_new(rtnl_net_get_size(), GFP_KERNEL);
@@ -844,7 +850,7 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 err_out:
nlmsg_free(msg);
 out:
-   if (put_target)
+   if (fillargs.add_ref)
put_net(target);
put_net(peer);
return err;
@@ -852,11 +858,11 @@ static int rtnl_net_getid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 
 struct rtnl_net_dump_cb {
struct net *tgt_net;
+   struct net *ref_net;
struct sk_buff *skb;
struct net_fill_args fillargs;
int idx;
int s_idx;
-   bool put_tgt_net;
 };
 
 static int rtnl_net_dumpid_one(int id, void *peer, void *data)
@@ -868,6 +874,8 @@ static int rtnl_net_dumpid_one(int id, void *peer, void 
*data)
goto cont;
 
net_cb->fillargs.nsid = id;
+   if (net_cb->fillargs.add_ref)
+   net_cb->fillargs.ref_nsid = __peernet2id(net_cb->ref_net, peer);
ret = rtnl_net_fill(net_cb->skb, _cb->fillargs);
if (ret < 0)
return ret;
@@ -904,8 +912,9 @@ static int rtnl_valid_dump_net_req(const struct nlmsghdr 
*nlh, struct sock *sk,
   "Invalid target network 
namespace id");
return PTR_ERR(net);
}
+   net_cb->fillargs.add_ref = true;
+   net_cb->ref_net = net_cb->tgt_net;
net_cb->tgt_net = net;
-   net_cb->put_tgt_net = true;
} else {
NL_SET_BAD_ATTR(extack, tb[i]);
NL_SET_ERR_MSG(extack,
@@ -940,12 +949,21 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct 
netlink_callback *cb)
}
 
spin_lock_bh(_cb.tgt_net->nsid_lock);
+   if (net_cb.fillargs.add_ref &&
+   !net_eq(net_cb.ref_net, net_cb.tgt_net) &&
+   !spin_trylock_bh(_cb.ref_net->nsid_lock)) {
+   err = -EAGAIN;
+   goto end;
+   }
idr_for_each(_cb.tgt_net->netns_ids, rtnl_net_dumpid_one, _cb);
+   if (net_cb.fillargs.add_ref &&
+   !net_eq(net_cb.ref_net, net_cb.tgt_net))
+   spin_unlock_bh(_cb.ref_net->nsid_lock);
spin_unlock_bh(_cb.tgt_net->nsid_lock);
 
cb->args[0] = net_cb.idx;
 end:
-   if (net_cb.put_tgt_net)
+   if (net_cb.fillargs.add_ref)
put_net(net_cb.tgt_net);
return err < 0 ? err : skb->len;
 }
-- 
2.18.0



Re: DSA support for Marvell 88e6065 switch

2018-11-22 Thread Lennert Buytenhek
On Thu, Nov 22, 2018 at 02:21:23PM +0100, Pavel Machek wrote:

> > > If I wanted it to work, what do I need to do? AFAICT phy autoprobing
> > > should just attach it as soon as it is compiled in?
> > 
> > Nope. It is a switch, not a PHY. Switches are never auto-probed
> > because they are not guaranteed to have ID registers.
> > 
> > You need to use the legacy device tree binding. Look in
> > Documentation/devicetree/bindings/net/dsa/dsa.txt, section Deprecated
> > Binding. You can get more examples if you checkout old kernels. Or
> > kirkwood-rd88f6281.dtsi, the dsa { } node which is disabled.
> 
> Thanks; I ported code from mv88e66xx in the meantime, and switch
> appears to be detected.
> 
> But I'm running into problems with tagging code, and I guess I'd like
> some help understanding.
> 
> tag_trailer: allocates new skb, then copies data around.
> 
> tag_qca: does dev->stats.tx_packets++, and reuses existing skb.
> 
> tag_brcm: reuses existing skb.
> 
> Is qca wrong in adjusting the statistics? Why does trailer allocate
> new skb?
> 
> 6065 seems to use 2-byte header between "SFD" and "Destination
> address" in the ethernet frame. That's ... strange place to put
> header, as addresses are now shifted. I need to put ethernet in
> promisc mode (by running tcpdump) to get data moving.. and can not
> figure out what to do in tag_...

Does this switch chip not also support trailer mode?

There's basically four tagging modes for Marvell switch chips: header
mode (the one you described), trailer mode (tag_trailer.c), DSA and
ethertype DSA.  The switch chips I worked on that didn't support
(ethertype) DSA tagging did support both header and trailer modes,
and I chose to run them in trailer mode for the reasons you describe
above, but if your chip doesn't support trailer mode, then yes,
you'll have to add support for header mode and put the underlying
interface into promiscuous mode and such.


[PATCH net] be2net: Fix NULL pointer dereference in be_tx_timeout()

2018-11-22 Thread Petr Oros
The driver enumerates Tx queues in ndo_tx_timeout() handler, here is
possible race with be_update_queues. For this case we set carrier_off.
It prevents netdev watchdog to be fired after be_clear_queues().
The watchdog timeout doesn't make any sense here as we re-creating queues.

Reproducer:
We can reproduce bug with ethtool when changing queue count
  ethtool -L $netif combined 1
  ethtool -L $netif combined 32
If oops is not triggered imediately, just run it again or in loop.

Oops:
[  865.768648] NETDEV WATCHDOG: enp4s0f0 (be2net): transmit queue 0 timed out
[  865.775539] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:461 
dev_watchdog+0x20d/0x220
[  865.783796] Modules linked in: be2net intel_rapl sb_edac 
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass 
crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel 
mei_me intel_cstate intel_uncore ipmi_ssif mei ipmi_si pcspkr sg i2c_i801 
joydev lpc_ich intel_rapl_perf ipmi_devintf ioatdma ipmi_msghandler xfs 
libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt 
fb_sys_fops ttm ahci libahci crc32c_intel drm serio_raw libata igb dca 
i2c_algo_bit wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: be2net]
[  865.834289] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.20.0-rc3+ #2
[  865.840640] Hardware name: Supermicro X9DBU/X9DBU, BIOS 3.2 01/15/2015
[  865.847168] RIP: 0010:dev_watchdog+0x20d/0x220
[  865.851612] Code: 00 49 63 4e e0 eb 92 4c 89 e7 c6 05 a5 de c9 00 01 e8 f7 
b2 fc ff 89 d9 4c 89 e6 48 c7 c7 a0 d1 b2 99 48 89 c2 e8 7d b0 98 ff <0f> 0b eb 
c0 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66
[  865.870358] RSP: 0018:9bee73ac3e88 EFLAGS: 00010282
[  865.875583] RAX:  RBX:  RCX: 083f
[  865.882707] RDX:  RSI: 00f6 RDI: 003f
[  865.889832] RBP: 9bee5fa0045c R08: 0824 R09: 0007
[  865.896956] R10:  R11: 9a3f162d R12: 9bee5fa0
[  865.904088] R13: 0003 R14: 9bee5fa00480 R15: 0020
[  865.911214] FS:  () GS:9bee73ac() 
knlGS:
[  865.919298] CS:  0010 DS:  ES:  CR0: 80050033
[  865.925037] CR2: 5580497ce040 CR3: 0002cf60a004 CR4: 000606e0
[  865.932170] Call Trace:
[  865.934626]  
[  865.936645]  ? pfifo_fast_dequeue+0x160/0x160
[  865.941005]  call_timer_fn+0x2b/0x130
[  865.944670]  run_timer_softirq+0x3b9/0x3f0
[  865.948768]  ? tick_sched_timer+0x37/0x70
[  865.952779]  ? __hrtimer_run_queues+0x110/0x280
[  865.957314]  __do_softirq+0xdd/0x2fe
[  865.960896]  irq_exit+0xfa/0x100
[  865.964125]  smp_apic_timer_interrupt+0x74/0x140
[  865.968745]  apic_timer_interrupt+0xf/0x20
[  865.972844]  
[  865.974953] RIP: 0010:cpuidle_enter_state+0xb0/0x320
[  865.979915] Code: 89 c3 66 66 66 66 90 31 ff e8 0c 07 a6 ff 80 7c 24 0b 00 
74 12 9c 58 f6 c4 02 0f 85 46 02 00 00 31 ff e8 33 e0 ab ff fb 85 ed <0f> 88 1a 
02 00 00 48 b8 ff ff ff ff f3 01 00 00 48 2b 1c 24 48 39
[  865.998661] RSP: 0018:bc9ac19e7ea0 EFLAGS: 0206 ORIG_RAX: 
ff13
[  866.006225] RAX: 9bee73ae1dc0 RBX: 00c9938e11ae RCX: 001f
[  866.013350] RDX: 00c9938e11ae RSI: 435e532a RDI: 
[  866.020474] RBP: 0005 R08: 0002 R09: 00021640
[  866.027598] R10: 9c434b946fde R11: 9bee73ae0e44 R12: 99d27538
[  866.034723] R13: 9bee73aec628 R14: 0005 R15: 
[  866.041860]  do_idle+0x1f1/0x230
[  866.045091]  cpu_startup_entry+0x19/0x20
[  866.049016]  start_secondary+0x195/0x1e0
[  866.052943]  secondary_startup_64+0xb6/0xc0
[  866.057129] ---[ end trace dead88c26bcd8261 ]---
[  866.061750] be2net :04:00.0: TXQ Dump: 0 H: 0 T: 0 used: 0, qid: 0x2
[  866.068452] BUG: unable to handle kernel NULL pointer dereference at 

[  866.076273] PGD 0 P4D 0
[  866.078810] Oops:  [#1] SMP PTI
[  866.082305] CPU: 3 PID: 0 Comm: swapper/3 Tainted: GW 
4.20.0-rc3+ #2
[  866.090041] Hardware name: Supermicro X9DBU/X9DBU, BIOS 3.2 01/15/2015
[  866.096566] RIP: 0010:be_tx_timeout+0x7c/0x300 [be2net]
[  866.101786] Code: 8b 45 1c 41 8b 4d 14 48 89 df 31 ed 45 8b 4d 18 48 c7 c6 
80 51 2c c0 50 45 8b 45 10 8b 54 24 14 e8 09 a7 cb d8 4d 8b 7d 20 59 <41> 8b 0c 
af 45 8b 44 af 04 41 8b 74 af 0c 45 8b 4c af 08 89 ca 44
[  866.120532] RSP: 0018:9bee73ac3e38 EFLAGS: 00010246
[  866.125758] RAX:  RBX: 9bee72d6b0b0 RCX: 0002
[  866.132882] RDX:  RSI: 00f6 RDI: 003f
[  866.140014] RBP:  R08: 084d R09: 0007
[  866.147138] R10:  R11: 9a3f162d R12: c02c60ab
[  866.154263] R13: 9bee5fa04b40 R14: c02c613a R15: 
[  866.161388] FS:  

[PATCH v3 0/4] Fix unsafe BPF_PROG_TEST_RUN interface

2018-11-22 Thread Lorenz Bauer
Right now, there is no safe way to use BPF_PROG_TEST_RUN with data_out.
This is because bpf_test_finish copies the output buffer to user space
without checking its size. This can lead to the kernel overwriting
data in user space after the buffer if xdp_adjust_head and friends are
in play.

Changes in v3:
* Introduce bpf_prog_test_run_xattr instead of modifying the existing
  function

Changes in v2:
* Make the syscall return ENOSPC if data_size_out is too small
* Make bpf_prog_test_run return EINVAL if size_out is missing
* Document the new behaviour of data_size_out

Lorenz Bauer (4):
  bpf: respect size hint to BPF_PROG_TEST_RUN if present
  tools: sync uapi/linux/bpf.h
  libbpf: add bpf_prog_test_run_xattr
  selftests: add a test for bpf_prog_test_run_xattr

 include/uapi/linux/bpf.h |  7 +++-
 net/bpf/test_run.c   | 15 +++-
 tools/include/uapi/linux/bpf.h   |  7 +++-
 tools/lib/bpf/bpf.c  | 27 +
 tools/lib/bpf/bpf.h  | 13 +++
 tools/testing/selftests/bpf/test_progs.c | 49 
 6 files changed, 112 insertions(+), 6 deletions(-)

-- 
2.17.1



[PATCH v3 4/4] selftests: add a test for bpf_prog_test_run_xattr

2018-11-22 Thread Lorenz Bauer
Make sure that bpf_prog_test_run_xattr returns the correct length
and that the kernel respects the output size hint. Also check
that errno indicates ENOSPC if there is a short output buffer given.

Signed-off-by: Lorenz Bauer 
---
 tools/testing/selftests/bpf/test_progs.c | 49 
 1 file changed, 49 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index c1e688f61061..f9f5b1dbcc83 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -124,6 +124,54 @@ static void test_pkt_access(void)
bpf_object__close(obj);
 }
 
+static void test_prog_run_xattr(void)
+{
+   const char *file = "./test_pkt_access.o";
+   __u32 duration, retval, size_out;
+   struct bpf_object *obj;
+   char buf[10];
+   int err;
+   struct bpf_prog_test_run_attr tattr = {
+   .repeat = 1,
+   .data = _v4,
+   .size = sizeof(pkt_v4),
+   .data_out = buf,
+   .size_out = 5,
+   };
+
+   err = bpf_prog_load(file, BPF_PROG_TYPE_SCHED_CLS, ,
+   _fd);
+   if (CHECK(err, "load", "err %d errno %d\n", err, errno))
+   return;
+
+   memset(buf, 0, sizeof(buf));
+
+   err = bpf_prog_test_run_xattr(, _out, , );
+   CHECK(err != -1 || errno != ENOSPC || retval, "run",
+ "err %d errno %d retval %d\n", err, errno, retval);
+
+   CHECK(size_out != sizeof(pkt_v4), "output_size",
+ "incorrect output size, want %lu have %u\n",
+ sizeof(pkt_v4), size_out);
+
+   CHECK(buf[5] != 0, "overflow",
+ "BPF_PROG_TEST_RUN ignored size hint\n");
+
+   tattr.data_out = NULL;
+   tattr.size_out = 0;
+   errno = 0;
+
+   err = bpf_prog_test_run_xattr(, NULL, , );
+   CHECK(err || errno || retval, "run_no_output",
+ "err %d errno %d retval %d\n", err, errno, retval);
+
+   tattr.size_out = 1;
+   err = bpf_prog_test_run_xattr(, NULL, NULL, );
+   CHECK(err != -EINVAL, "run_wrong_size_out", "err %d\n", err);
+
+   bpf_object__close(obj);
+}
+
 static void test_xdp(void)
 {
struct vip key4 = {.protocol = 6, .family = AF_INET};
@@ -1837,6 +1885,7 @@ int main(void)
jit_enabled = is_jit_enabled();
 
test_pkt_access();
+   test_prog_run_xattr();
test_xdp();
test_xdp_adjust_tail();
test_l4lb_all();
-- 
2.17.1



  1   2   >