Re: [PATCH][xfrm-next] xfrm6: remove BUG_ON from xfrm6_dst_ifdown
On Mon, Nov 12, 2018 at 05:28:22PM +0800, Li RongQing wrote: > if loopback_idev is NULL pointer, and the following access of > loopback_idev will trigger panic, which is same as BUG_ON > > Signed-off-by: Li RongQing Patch applied, thanks!
Suggesting patch for tcp_close
Dear all, This is Soukin Bae working on Samsung Elec. Mobile Division. we have a problem with tcp closing. in shortly, 1. on 4-way handshking to close session 2. if ack pkt is not arrived from opposite side 3. then session can't be closed forever in mobile device, condition 2 can be happend in various case. like as turn off wifi or mobile data. or bad condition of air network, etc.. this could be occur in both side of connection. when issue happened during active closing, the session remained with FIN_WAIT1 state. and at passive closing, remained with LAST_ACK state. - below is test result after wifi on/off repetition (without mobile data). maybe 'Foreign Address' sent the fin-ack when wifi-off state. so device coun't recieve ack pkt further, and the session is remained permanently. and their count is growing up. this is resource leak. ### turn on wifi D:\Test>adb shell netstat -npWae Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program Name tcp0 0 127.0.0.1:50370.0.0.0:* LISTEN 0 36357 6907/adbd tcp6 0 0 2001:2d8:ed1c:de1c:bd94:fe5:2d9a:d8e4:58660 2404:6800:4008:c00::bc:5228 ESTABLISHED 10041 74347 6523/com.google.android.gms.persistent tcp6 0 0 2001:2d8:ed1c:de1c:bd94:fe5:2d9a:d8e4:35148 2404:6800:4004:800::2003:80 LAST_ACK0 0 - tcp6 0 0 2001:2d8:ed1c:de1c:bd94:fe5:2d9a:d8e4:37512 64:ff9b::3444:f3dc:443ESTABLISHED 10137 77447 9522/com.samsung.android.game.gos tcp6 0 0 2001:2d8:ed1c:de1c:bd94:fe5:2d9a:d8e4:49294 2404:6800:4005:80c::2004:443 LAST_ACK0 0 - tcp6 1 0 2001:2d8:ed1c:de1c:bd94:fe5:2d9a:d8e4:35260 64:ff9b::34d0:9421:80 LAST_ACK0 0 - ### turn off wifi D:\Test>adb shell netstat -npWae Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program Name tcp0 0 127.0.0.1:50370.0.0.0:* LISTEN 0 36357 6907/adbd tcp6 0 0 2001:2d8:ed1c:de1c:bd94:fe5:2d9a:d8e4:35148 2404:6800:4004:800::2003:80 LAST_ACK0 0 - tcp6 0 0 2001:2d8:ed1c:de1c:bd94:fe5:2d9a:d8e4:49294 2404:6800:4005:80c::2004:443 LAST_ACK0 0 - tcp6 1 0 2001:2d8:ed1c:de1c:bd94:fe5:2d9a:d8e4:35260 64:ff9b::34d0:9421:80 LAST_ACK0 0 - - this is our analysis when app finished using the socket(tcp session), it calls sock_close. then tcp_close() makes sk->sk_state to LAST_ACK, and sock to SOCK_DEAD by excute sock_orphan(). 11-23 11:40:55.676 [5: Thread-44:11210] TCP: bsj: tcp_set_state: TCP sk=ffc8a789c640, in:80092, State Close Wait -> Last ACK, [2404:6800:4004:800::2003] 11-23 11:40:55.676 [5: Thread-44:11210] Call trace: 11-23 11:40:55.676 [5: Thread-44:11210] [] tcp_set_state+0x1b8/0x1f0 11-23 11:40:55.676 [5: Thread-44:11210] [] tcp_close+0x484/0x534 11-23 11:40:55.676 [5: Thread-44:11210] [] inet_release+0x60/0x74 11-23 11:40:55.676 [5: Thread-44:11210] [] inet6_release+0x30/0x48 11-23 11:40:55.676 [5: Thread-44:11210] [] __sock_release+0x40/0x104 11-23 11:40:55.676 [5: Thread-44:11210] [] sock_close+0x18/0x28 11-23 11:40:55.678 [5: Thread-44:11210] TCP: bsj: sock_orphan: TCP sk=ffc8a789c640, in:80092, State Last ACK, [2404:6800:4004:800::2003] at this point, if the FIN_ACK comes, there's no problem. all is well~ but without that and when turn off wifi, netd trying to close all the session by calling tcp_abort, sock_diag_destory. 11-23 11:41:38.463 [4: netd: 5323] TCP: bsj: tcp_abort: SOCK_DEAD!!! : TCP sk=ffc8a789c640, in:0, State Last ACK, caller: , [2404:6800:4004:800::2003] 11-23 11:41:38.464 [4: netd: 5323] TCP: bsj: tcp_abort: SOCK_DEAD!!! : TCP sk=ffc8a789b840, in:0, State Last ACK, caller: , [2404:6800:4005:80c::2004] 11-23 11:41:38.464 [4: netd: 5323] TCP: bsj: tcp_abort: SOCK_DEAD!!! : TCP sk=ffc8a7899c40, in:0, State Last ACK, caller: , [64:ff9b::34d0:9421] but because of this sock was already changed to SOCK_DEAT state by tcp_close(), tcp_done() can't be excuted. so this session can't be closed. int tcp_abort(struct sock *sk, int err) { ... if (!sock_flag(sk, SOCK_DEAD)) { when SOCK_DEAD, tcp_done() be skip. ... sk->sk_error_report(sk); if (tcp_need_reset(sk->sk_state))
[PATCH net-next 0/4] qed* enhancements series
From: Sudarsana Reddy Kalluru The patch series add few enhancements to qed/qede drivers. Please consider applying it to "net-next". Sudarsana Reddy Kalluru (4): qed: Display port_id in the UFP debug messages. qede: Simplify the usage of qede-flags. qede: Update link status only when interface is ready. qed: Add support for MBI upgrade over MFW. drivers/net/ethernet/qlogic/qed/qed_hsi.h| 6 +++ drivers/net/ethernet/qlogic/qed/qed_main.c | 13 +- drivers/net/ethernet/qlogic/qed/qed_mcp.c| 65 +++- drivers/net/ethernet/qlogic/qed/qed_mcp.h| 10 - drivers/net/ethernet/qlogic/qede/qede.h | 12 +++-- drivers/net/ethernet/qlogic/qede/qede_main.c | 10 +++-- drivers/net/ethernet/qlogic/qede/qede_ptp.c | 6 +-- 7 files changed, 71 insertions(+), 51 deletions(-) -- 1.8.3.1
[PATCH net-next 4/4] qed: Add support for MBI upgrade over MFW.
The patch adds driver support for MBI image update through MFW. Signed-off-by: Sudarsana Reddy Kalluru Signed-off-by: Ariel Elior Signed-off-by: Michal Kalderon --- drivers/net/ethernet/qlogic/qed/qed_hsi.h | 6 drivers/net/ethernet/qlogic/qed/qed_main.c | 13 +++-- drivers/net/ethernet/qlogic/qed/qed_mcp.c | 45 +++--- drivers/net/ethernet/qlogic/qed/qed_mcp.h | 10 --- 4 files changed, 40 insertions(+), 34 deletions(-) diff --git a/drivers/net/ethernet/qlogic/qed/qed_hsi.h b/drivers/net/ethernet/qlogic/qed/qed_hsi.h index 5c221eb..7e120b5 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_hsi.h +++ b/drivers/net/ethernet/qlogic/qed/qed_hsi.h @@ -12655,6 +12655,7 @@ struct public_drv_mb { #define DRV_MB_PARAM_DCBX_NOTIFY_MASK 0x00FF #define DRV_MB_PARAM_DCBX_NOTIFY_SHIFT 3 +#define DRV_MB_PARAM_NVM_PUT_FILE_BEGIN_MBI 0x3 #define DRV_MB_PARAM_NVM_LEN_OFFSET24 #define DRV_MB_PARAM_CFG_VF_MSIX_VF_ID_SHIFT 0 @@ -12814,6 +12815,11 @@ struct public_drv_mb { union drv_union_data union_data; }; +#define FW_MB_PARAM_NVM_PUT_FILE_REQ_OFFSET_MASK 0x00ff +#define FW_MB_PARAM_NVM_PUT_FILE_REQ_OFFSET_SHIFT 0 +#define FW_MB_PARAM_NVM_PUT_FILE_REQ_SIZE_MASK 0xff00 +#define FW_MB_PARAM_NVM_PUT_FILE_REQ_SIZE_SHIFT24 + enum MFW_DRV_MSG_TYPE { MFW_DRV_MSG_LINK_CHANGE, MFW_DRV_MSG_FLR_FW_ACK_FAILED, diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c b/drivers/net/ethernet/qlogic/qed/qed_main.c index fff7f04..4b3e682 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_main.c +++ b/drivers/net/ethernet/qlogic/qed/qed_main.c @@ -1939,21 +1939,30 @@ static int qed_nvm_flash_image_access(struct qed_dev *cdev, const u8 **data, * 0B | 0x3 [command index]| * 4B | b'0: check_response? | b'1-31 reserved| * 8B | File-type | reserved | + * 12B |Image length in bytes | * \--/ * Start a new file of the provided type */ static int qed_nvm_flash_image_file_start(struct qed_dev *cdev, const u8 **data, bool *check_resp) { + u32 file_type, file_size = 0; int rc; *data += 4; *check_resp = !!(**data & BIT(0)); *data += 4; + file_type = **data; DP_VERBOSE(cdev, NETIF_MSG_DRV, - "About to start a new file of type %02x\n", **data); - rc = qed_mcp_nvm_put_file_begin(cdev, **data); + "About to start a new file of type %02x\n", file_type); + if (file_type == DRV_MB_PARAM_NVM_PUT_FILE_BEGIN_MBI) { + *data += 4; + file_size = *((u32 *)(*data)); + } + + rc = qed_mcp_nvm_write(cdev, QED_PUT_FILE_BEGIN, file_type, + (u8 *)(_size), 4); *data += 4; return rc; diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c index 34ed757..e7f18e3 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c +++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c @@ -2745,24 +2745,6 @@ int qed_mcp_nvm_resp(struct qed_dev *cdev, u8 *p_buf) return 0; } -int qed_mcp_nvm_put_file_begin(struct qed_dev *cdev, u32 addr) -{ - struct qed_hwfn *p_hwfn = QED_LEADING_HWFN(cdev); - struct qed_ptt *p_ptt; - u32 resp, param; - int rc; - - p_ptt = qed_ptt_acquire(p_hwfn); - if (!p_ptt) - return -EBUSY; - rc = qed_mcp_cmd(p_hwfn, p_ptt, DRV_MSG_CODE_NVM_PUT_FILE_BEGIN, addr, -, ); - cdev->mcp_nvm_resp = resp; - qed_ptt_release(p_hwfn, p_ptt); - - return rc; -} - int qed_mcp_nvm_write(struct qed_dev *cdev, u32 cmd, u32 addr, u8 *p_buf, u32 len) { @@ -2776,6 +2758,9 @@ int qed_mcp_nvm_write(struct qed_dev *cdev, return -EBUSY; switch (cmd) { + case QED_PUT_FILE_BEGIN: + nvm_cmd = DRV_MSG_CODE_NVM_PUT_FILE_BEGIN; + break; case QED_PUT_FILE_DATA: nvm_cmd = DRV_MSG_CODE_NVM_PUT_FILE_DATA; break; @@ -2788,10 +2773,14 @@ int qed_mcp_nvm_write(struct qed_dev *cdev, goto out; } + buf_size = min_t(u32, (len - buf_idx), MCP_DRV_NVM_BUF_LEN); while (buf_idx < len) { - buf_size = min_t(u32, (len - buf_idx), MCP_DRV_NVM_BUF_LEN); - nvm_offset = ((buf_size << DRV_MB_PARAM_NVM_LEN_OFFSET) | - addr) + buf_idx; + if (cmd == QED_PUT_FILE_BEGIN) + nvm_offset = addr; + else + nvm_offset = ((buf_size << +
[PATCH net-next 3/4] qede: Update link status only when interface is ready.
In the case of internal reload (e.g., mtu change), there could be a race between link-up notification from mfw and the driver unload processing. In such case kernel assumes the link is up and starts using the queues which leads to the server crash. Send link notification to the kernel only when driver has already requested MFW for the link. Signed-off-by: Sudarsana Reddy Kalluru Signed-off-by: Ariel Elior Signed-off-by: Michal Kalderon --- drivers/net/ethernet/qlogic/qede/qede.h | 1 + drivers/net/ethernet/qlogic/qede/qede_main.c | 8 ++-- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/qlogic/qede/qede.h b/drivers/net/ethernet/qlogic/qede/qede.h index f8ced12..8c0fe59 100644 --- a/drivers/net/ethernet/qlogic/qede/qede.h +++ b/drivers/net/ethernet/qlogic/qede/qede.h @@ -170,6 +170,7 @@ struct qede_rdma_dev { enum qede_flags_bit { QEDE_FLAGS_IS_VF = 0, + QEDE_FLAGS_LINK_REQUESTED, QEDE_FLAGS_PTP_TX_IN_PRORGESS, QEDE_FLAGS_TX_TIMESTAMPING_EN }; diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c index 0f1c480..efbb4f3 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_main.c +++ b/drivers/net/ethernet/qlogic/qede/qede_main.c @@ -2057,6 +2057,8 @@ static void qede_unload(struct qede_dev *edev, enum qede_unload_mode mode, if (!is_locked) __qede_lock(edev); + clear_bit(QEDE_FLAGS_LINK_REQUESTED, >flags); + edev->state = QEDE_STATE_CLOSED; qede_rdma_dev_event_close(edev); @@ -2163,6 +2165,8 @@ static int qede_load(struct qede_dev *edev, enum qede_load_mode mode, /* Program un-configured VLANs */ qede_configure_vlan_filters(edev); + set_bit(QEDE_FLAGS_LINK_REQUESTED, >flags); + /* Ask for link-up using current configuration */ memset(_params, 0, sizeof(link_params)); link_params.link_up = true; @@ -2258,8 +2262,8 @@ static void qede_link_update(void *dev, struct qed_link_output *link) { struct qede_dev *edev = dev; - if (!netif_running(edev->ndev)) { - DP_VERBOSE(edev, NETIF_MSG_LINK, "Interface is not running\n"); + if (!test_bit(QEDE_FLAGS_LINK_REQUESTED, >flags)) { + DP_VERBOSE(edev, NETIF_MSG_LINK, "Interface is not ready\n"); return; } -- 1.8.3.1
[PATCH net-next 2/4] qede: Simplify the usage of qede-flags.
The values represented by qede->flags is being used in mixed ways: 1. As 'value' at some places e.g., QEDE_FLAGS_IS_VF usage 2. As bit-mask(value) at some places e.g., QEDE_FLAGS_PTP_TX_IN_PRORGESS usage. This implementation pose problems in future when we want to add more flag values e.g., overlap of the values, overflow of 64-bit storage. Updated the implementation to go with approach (2) for qede->flags. Signed-off-by: Sudarsana Reddy Kalluru Signed-off-by: Ariel Elior Signed-off-by: Michal Kalderon --- drivers/net/ethernet/qlogic/qede/qede.h | 11 +++ drivers/net/ethernet/qlogic/qede/qede_main.c | 2 +- drivers/net/ethernet/qlogic/qede/qede_ptp.c | 6 +++--- 3 files changed, 11 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/qlogic/qede/qede.h b/drivers/net/ethernet/qlogic/qede/qede.h index de98a97..f8ced12 100644 --- a/drivers/net/ethernet/qlogic/qede/qede.h +++ b/drivers/net/ethernet/qlogic/qede/qede.h @@ -168,6 +168,12 @@ struct qede_rdma_dev { #define QEDE_RFS_MAX_FLTR 256 +enum qede_flags_bit { + QEDE_FLAGS_IS_VF = 0, + QEDE_FLAGS_PTP_TX_IN_PRORGESS, + QEDE_FLAGS_TX_TIMESTAMPING_EN +}; + struct qede_dev { struct qed_dev *cdev; struct net_device *ndev; @@ -177,10 +183,7 @@ struct qede_dev { u8 dp_level; unsigned long flags; -#define QEDE_FLAG_IS_VFBIT(0) -#define IS_VF(edev)(!!((edev)->flags & QEDE_FLAG_IS_VF)) -#define QEDE_TX_TIMESTAMPING_ENBIT(1) -#define QEDE_FLAGS_PTP_TX_IN_PRORGESS BIT(2) +#define IS_VF(edev)(test_bit(QEDE_FLAGS_IS_VF, &(edev)->flags)) const struct qed_eth_ops*ops; struct qede_ptp *ptp; diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c index 46d0f2e..0f1c480 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_main.c +++ b/drivers/net/ethernet/qlogic/qede/qede_main.c @@ -1086,7 +1086,7 @@ static int __qede_probe(struct pci_dev *pdev, u32 dp_module, u8 dp_level, } if (is_vf) - edev->flags |= QEDE_FLAG_IS_VF; + set_bit(QEDE_FLAGS_IS_VF, >flags); qede_init_ndev(edev); diff --git a/drivers/net/ethernet/qlogic/qede/qede_ptp.c b/drivers/net/ethernet/qlogic/qede/qede_ptp.c index 013ff56..5f3f42a 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_ptp.c +++ b/drivers/net/ethernet/qlogic/qede/qede_ptp.c @@ -223,12 +223,12 @@ static int qede_ptp_cfg_filters(struct qede_dev *edev) switch (ptp->tx_type) { case HWTSTAMP_TX_ON: - edev->flags |= QEDE_TX_TIMESTAMPING_EN; + set_bit(QEDE_FLAGS_TX_TIMESTAMPING_EN, >flags); tx_type = QED_PTP_HWTSTAMP_TX_ON; break; case HWTSTAMP_TX_OFF: - edev->flags &= ~QEDE_TX_TIMESTAMPING_EN; + clear_bit(QEDE_FLAGS_TX_TIMESTAMPING_EN, >flags); tx_type = QED_PTP_HWTSTAMP_TX_OFF; break; @@ -518,7 +518,7 @@ void qede_ptp_tx_ts(struct qede_dev *edev, struct sk_buff *skb) if (test_and_set_bit_lock(QEDE_FLAGS_PTP_TX_IN_PRORGESS, >flags)) return; - if (unlikely(!(edev->flags & QEDE_TX_TIMESTAMPING_EN))) { + if (unlikely(!test_bit(QEDE_FLAGS_TX_TIMESTAMPING_EN, >flags))) { DP_NOTICE(edev, "Tx timestamping was not enabled, this packet will not be timestamped\n"); } else if (unlikely(ptp->tx_skb)) { -- 1.8.3.1
[PATCH net-next 1/4] qed: Display port_id in the UFP debug messages.
MFW sends UFP notifications mostly during the device init phase and PFs might not be assigned with a name by this time. Hence capturing port-id in the debug messages would help in finding which PF the ufp notification was sent to. Also, fixed a minor scemantic issue in a debug print. Signed-off-by: Sudarsana Reddy Kalluru Signed-off-by: Ariel Elior Signed-off-by: Michal Kalderon --- drivers/net/ethernet/qlogic/qed/qed_mcp.c | 20 +--- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c index a96364d..34ed757 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c +++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c @@ -1619,7 +1619,7 @@ static void qed_mcp_update_stag(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt) qed_sp_pf_update_stag(p_hwfn); } - DP_VERBOSE(p_hwfn, QED_MSG_SP, "ovlan = %d hw_mode = 0x%x\n", + DP_VERBOSE(p_hwfn, QED_MSG_SP, "ovlan = %d hw_mode = 0x%x\n", p_hwfn->mcp_info->func_info.ovlan, p_hwfn->hw_info.hw_mode); /* Acknowledge the MFW */ @@ -1641,7 +1641,9 @@ void qed_mcp_read_ufp_config(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt) val = (port_cfg & OEM_CFG_CHANNEL_TYPE_MASK) >> OEM_CFG_CHANNEL_TYPE_OFFSET; if (val != OEM_CFG_CHANNEL_TYPE_STAGGED) - DP_NOTICE(p_hwfn, "Incorrect UFP Channel type %d\n", val); + DP_NOTICE(p_hwfn, + "Incorrect UFP Channel type %d port_id 0x%02x\n", + val, MFW_PORT(p_hwfn)); val = (port_cfg & OEM_CFG_SCHED_TYPE_MASK) >> OEM_CFG_SCHED_TYPE_OFFSET; if (val == OEM_CFG_SCHED_TYPE_ETS) { @@ -1650,7 +1652,9 @@ void qed_mcp_read_ufp_config(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt) p_hwfn->ufp_info.mode = QED_UFP_MODE_VNIC_BW; } else { p_hwfn->ufp_info.mode = QED_UFP_MODE_UNKNOWN; - DP_NOTICE(p_hwfn, "Unknown UFP scheduling mode %d\n", val); + DP_NOTICE(p_hwfn, + "Unknown UFP scheduling mode %d port_id 0x%02x\n", + val, MFW_PORT(p_hwfn)); } qed_mcp_get_shmem_func(p_hwfn, p_ptt, _info, MCP_PF_ID(p_hwfn)); @@ -1665,13 +1669,15 @@ void qed_mcp_read_ufp_config(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt) p_hwfn->ufp_info.pri_type = QED_UFP_PRI_OS; } else { p_hwfn->ufp_info.pri_type = QED_UFP_PRI_UNKNOWN; - DP_NOTICE(p_hwfn, "Unknown Host priority control %d\n", val); + DP_NOTICE(p_hwfn, + "Unknown Host priority control %d port_id 0x%02x\n", + val, MFW_PORT(p_hwfn)); } DP_NOTICE(p_hwfn, - "UFP shmem config: mode = %d tc = %d pri_type = %d\n", - p_hwfn->ufp_info.mode, - p_hwfn->ufp_info.tc, p_hwfn->ufp_info.pri_type); + "UFP shmem config: mode = %d tc = %d pri_type = %d port_id 0x%02x\n", + p_hwfn->ufp_info.mode, p_hwfn->ufp_info.tc, + p_hwfn->ufp_info.pri_type, MFW_PORT(p_hwfn)); } static int -- 1.8.3.1
答复: [PATCH][net-next] vhost:net: allocate 32KB memory instead of 32K pages when page frag refill
On 2018/11/23 上午10:04, Li RongQing wrote: > >when page frag refills, 32K pages, 128MB memory is asked, it hardly > >successes when system has memory stress > Looking at get_order(), it seems we get 3 after get_order(32768) since it > accepts the size of block. You are right, I understood wrongly, Please drop this patch, sorry for the noise -Q
答复: [PATCH] net: fix the per task frag allocator size
> get_order(8) returns zero here if I understood it correctly. You are right, I understood wrongly, Please drop this patch, sorry for the noise -Q
Re: [PATCH][net-next] vhost:net: allocate 32KB memory instead of 32K pages when page frag refill
On 2018/11/23 上午10:04, Li RongQing wrote: when page frag refills, 32K pages, 128MB memory is asked, it hardly successes when system has memory stress Looking at get_order(), it seems we get 3 after get_order(32768) since it accepts the size of block. /** * get_order - Determine the allocation order of a memory size * @size: The size for which to get the order ... define get_order(n) \ ( \ __builtin_constant_p(n) ? ( \ ((n) == 0UL) ? BITS_PER_LONG - PAGE_SHIFT : \ (((n) < (1UL << PAGE_SHIFT)) ? 0 : \ ilog2((n) - 1) - PAGE_SHIFT + 1) \ ^^^ ) : \ __get_order(n) \ ) And such large memory size will cause the underflow of reference bias, and make refcount of page chaos, since reference bias will be decreased to negative before the allocated memory is used up Do you have reproducer for this issue? Thanks so 32KB memory is safe choice, meanwhile, remove a unnecessary check Fixes: e4dab1e6ea64 ("vhost_net: mitigate page reference counting during page frag refill") Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- drivers/vhost/net.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index d919284f103b..b933a4a8e4ba 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -641,7 +641,7 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t total_len) !vhost_vq_avail_empty(vq->dev, vq); } -#define SKB_FRAG_PAGE_ORDER get_order(32768) +#define SKB_FRAG_PAGE_ORDER3 static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz, struct page_frag *pfrag, gfp_t gfp) @@ -654,17 +654,17 @@ static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz, pfrag->offset = 0; net->refcnt_bias = 0; - if (SKB_FRAG_PAGE_ORDER) { - /* Avoid direct reclaim but allow kswapd to wake */ - pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | - __GFP_COMP | __GFP_NOWARN | - __GFP_NORETRY, - SKB_FRAG_PAGE_ORDER); - if (likely(pfrag->page)) { - pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; - goto done; - } + + /* Avoid direct reclaim but allow kswapd to wake */ + pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | + __GFP_COMP | __GFP_NOWARN | + __GFP_NORETRY, + SKB_FRAG_PAGE_ORDER); + if (likely(pfrag->page)) { + pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; + goto done; } + pfrag->page = alloc_page(gfp); if (likely(pfrag->page)) { pfrag->size = PAGE_SIZE;
Re: [PATCH] net: fix the per task frag allocator size
On 2018/11/23 10:03, Li RongQing wrote: > when fill task frag, 32K pages, 128MB memory is asked, it > hardly successes when system has memory stress > > and commit '5640f7685831 ("net: use a per task frag allocator")' > said it wants 32768 bytes, not 32768 pages: > >"(up to 32768 bytes per frag, thats order-3 pages on x86)" > > Fixes: 5640f7685831e ("net: use a per task frag allocator") > Signed-off-by: Zhang Yu > Signed-off-by: Li RongQing > --- > net/core/sock.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/core/sock.c b/net/core/sock.c > index 6d7e189e3cd9..e3cbefeedf5c 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -2186,8 +2186,8 @@ static void sk_leave_memory_pressure(struct sock *sk) > } > } > > -/* On 32bit arches, an skb frag is limited to 2^15 */ > -#define SKB_FRAG_PAGE_ORDER get_order(32768) > +/* On 32bit arches, an skb frag is limited to 2^15 bytes*/ > +#define SKB_FRAG_PAGE_ORDER get_order(8) get_order(8) returns zero here if I understood it correctly. > > /** > * skb_page_frag_refill - check that a page_frag contains enough room >
[PATCH][net-next] vhost:net: allocate 32KB memory instead of 32K pages when page frag refill
when page frag refills, 32K pages, 128MB memory is asked, it hardly successes when system has memory stress And such large memory size will cause the underflow of reference bias, and make refcount of page chaos, since reference bias will be decreased to negative before the allocated memory is used up so 32KB memory is safe choice, meanwhile, remove a unnecessary check Fixes: e4dab1e6ea64 ("vhost_net: mitigate page reference counting during page frag refill") Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- drivers/vhost/net.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index d919284f103b..b933a4a8e4ba 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -641,7 +641,7 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t total_len) !vhost_vq_avail_empty(vq->dev, vq); } -#define SKB_FRAG_PAGE_ORDER get_order(32768) +#define SKB_FRAG_PAGE_ORDER3 static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz, struct page_frag *pfrag, gfp_t gfp) @@ -654,17 +654,17 @@ static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz, pfrag->offset = 0; net->refcnt_bias = 0; - if (SKB_FRAG_PAGE_ORDER) { - /* Avoid direct reclaim but allow kswapd to wake */ - pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | - __GFP_COMP | __GFP_NOWARN | - __GFP_NORETRY, - SKB_FRAG_PAGE_ORDER); - if (likely(pfrag->page)) { - pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; - goto done; - } + + /* Avoid direct reclaim but allow kswapd to wake */ + pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | + __GFP_COMP | __GFP_NOWARN | + __GFP_NORETRY, + SKB_FRAG_PAGE_ORDER); + if (likely(pfrag->page)) { + pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; + goto done; } + pfrag->page = alloc_page(gfp); if (likely(pfrag->page)) { pfrag->size = PAGE_SIZE; -- 2.16.2
[PATCH] net: fix the per task frag allocator size
when fill task frag, 32K pages, 128MB memory is asked, it hardly successes when system has memory stress and commit '5640f7685831 ("net: use a per task frag allocator")' said it wants 32768 bytes, not 32768 pages: "(up to 32768 bytes per frag, thats order-3 pages on x86)" Fixes: 5640f7685831e ("net: use a per task frag allocator") Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- net/core/sock.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index 6d7e189e3cd9..e3cbefeedf5c 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2186,8 +2186,8 @@ static void sk_leave_memory_pressure(struct sock *sk) } } -/* On 32bit arches, an skb frag is limited to 2^15 */ -#define SKB_FRAG_PAGE_ORDERget_order(32768) +/* On 32bit arches, an skb frag is limited to 2^15 bytes*/ +#define SKB_FRAG_PAGE_ORDERget_order(8) /** * skb_page_frag_refill - check that a page_frag contains enough room -- 2.16.2
Re: [Patch net-next 2/2] net: dump whole skb data in netdev_rx_csum_fault()
On Wed, Nov 21, 2018 at 11:33 AM Saeed Mahameed wrote: > > On Wed, 2018-11-21 at 10:26 -0800, Eric Dumazet wrote: > > On Wed, Nov 21, 2018 at 10:17 AM Cong Wang > > wrote: > > > On Wed, Nov 21, 2018 at 5:05 AM Eric Dumazet < > > > eric.duma...@gmail.com> wrote: > > > > > > > > > > > > On 11/20/2018 06:13 PM, Cong Wang wrote: > > > > > Currently, we only dump a few selected skb fields in > > > > > netdev_rx_csum_fault(). It is not suffient for debugging > > > > > checksum > > > > > fault. This patch introduces skb_dump() which dumps skb mac > > > > > header, > > > > > network header and its whole skb->data too. > > > > > > > > > > Cc: Herbert Xu > > > > > Cc: Eric Dumazet > > > > > Cc: David Miller > > > > > Signed-off-by: Cong Wang > > > > > --- > > > > > + print_hex_dump(level, "skb data: ", DUMP_PREFIX_OFFSET, > > > > > 16, 1, > > > > > +skb->data, skb->len, false); > > > > > > > > As I mentioned to David, we want all the bytes that were maybe > > > > already pulled > > > > > > > > (skb->head starting point, not skb->data) > > > > > > Hmm, with mac header and network header, it is effectively from > > > skb->head, no? > > > Is there anything between skb->head and mac header? > > > > Oh, I guess we wanted a single hex dump, or we need some user program > > to be able to > > rebuild from different memory zones the original CHECKSUM_COMPLETE > > value. > > > > Normally the driver keeps some headroom @skb->head, so the actual mac > header starts @ skb->head + driver_specific_headroom Good to know, but this headroom isn't covered by skb->csum, so not useful here, right? The skb->csum for mlx5 only covers network header and its payload.
Re: [Patch net-next 2/2] net: dump whole skb data in netdev_rx_csum_fault()
On Wed, Nov 21, 2018 at 10:26 AM Eric Dumazet wrote: > > On Wed, Nov 21, 2018 at 10:17 AM Cong Wang wrote: > > > > On Wed, Nov 21, 2018 at 5:05 AM Eric Dumazet wrote: > > > > > > > > > > > > On 11/20/2018 06:13 PM, Cong Wang wrote: > > > > Currently, we only dump a few selected skb fields in > > > > netdev_rx_csum_fault(). It is not suffient for debugging checksum > > > > fault. This patch introduces skb_dump() which dumps skb mac header, > > > > network header and its whole skb->data too. > > > > > > > > Cc: Herbert Xu > > > > Cc: Eric Dumazet > > > > Cc: David Miller > > > > Signed-off-by: Cong Wang > > > > --- > > > > > > > > > > + print_hex_dump(level, "skb data: ", DUMP_PREFIX_OFFSET, 16, 1, > > > > +skb->data, skb->len, false); > > > > > > As I mentioned to David, we want all the bytes that were maybe already > > > pulled > > > > > > (skb->head starting point, not skb->data) > > > > Hmm, with mac header and network header, it is effectively from skb->head, > > no? > > Is there anything between skb->head and mac header? > > Oh, I guess we wanted a single hex dump, or we need some user program > to be able to > rebuild from different memory zones the original CHECKSUM_COMPLETE value. Yeah, I can remove the prefix and dump the complete packet as one single block. This means I also need to check where skb->data points to. > > > > > > > > > Also we will miss the trimmed bytes if there were padding data. > > > And it seems the various bugs we have are all tied to the pulled or > > > trimmed bytes. > > > > > > > Unless I miss something, the tailing padding data should be in range > > [iphdr->tot_len, skb->len]. No? > > > Not after we did the pskb_trim_rcsum() call, since it has effectively > reduced skb->len by the number of padding bytes. Sure, this patch can't change where netdev_rx_csum_fault() gets called. We either need to move the checksum validation earlier, or move the trimming later, none of them belongs to this patch. Thanks.
Re: [PATCH net-next 00/12] switchdev: Convert switchdev_port_obj_{add,del}() to notifiers
Petr Machata writes: > An offloading driver may need to have access to switchdev events on > ports that aren't directly under its control. An example is a VXLAN port > attached to a bridge offloaded by a driver. The driver needs to know > about VLANs configured on the VXLAN device. However the VXLAN device > isn't stashed between the bridge and a front-panel-port device (such as > is the case e.g. for LAG devices), so the usual switchdev ops don't > reach the driver. mlxsw will use these notifications to offload VXLAN devices attached to a VLAN-aware bridge. The patches are available here should anyone wish to take a look: https://github.com/idosch/linux/commits/vxlan Thanks, Petr
[PATCH bpf-next 0/3] bpf: add sk_msg helper sk_msg_pop_data
After being able to add metadata to messages with sk_msg_push_data we have also found it useful to be able to "pop" this metadata off before sending it to applications in some cases. This series adds a new helper sk_msg_pop_data() and the associated patches to add tests and tools/lib support. Thanks! John Fastabend (3): bpf: helper to pop data from messages bpf: add msg_pop_data helper to tools bpf: test_sockmap, add options for msg_pop_data() helper usage include/uapi/linux/bpf.h| 13 +- net/core/filter.c | 169 net/ipv4/tcp_bpf.c | 14 +- tools/include/uapi/linux/bpf.h | 13 +- tools/testing/selftests/bpf/bpf_helpers.h | 2 + tools/testing/selftests/bpf/test_sockmap.c | 127 +- tools/testing/selftests/bpf/test_sockmap_kern.h | 70 -- 7 files changed, 386 insertions(+), 22 deletions(-) -- 2.7.4
[PATCH bpf-next 2/3] bpf: add msg_pop_data helper to tools
Add the necessary header definitions to tools for new msg_pop_data_helper. Signed-off-by: John Fastabend --- tools/include/uapi/linux/bpf.h| 13 - tools/testing/selftests/bpf/bpf_helpers.h | 2 ++ 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index c1554aa..95cf7a5 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -2268,6 +2268,16 @@ union bpf_attr { * * Return * 0 on success, or a negative error in case of failure. + * + * int bpf_msg_pop_data(struct sk_msg_buff *msg, u32 start, u32 pop, u64 flags) + * Description + * Will remove 'pop' bytes from a msg starting at byte 'start'. + * This result in ENOMEM errors under certain situations where + * a allocation and copy are required due to a full ring buffer. + * However, the helper will try to avoid doing the allocation + * if possible. Other errors can occur if input parameters are + * invalid either do to start byte not being valid part of msg + * payload and/or pop value being to large. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -2360,7 +2370,8 @@ union bpf_attr { FN(map_push_elem), \ FN(map_pop_elem), \ FN(map_peek_elem), \ - FN(msg_push_data), + FN(msg_push_data), \ + FN(msg_pop_data), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h index 686e57c..7b69519 100644 --- a/tools/testing/selftests/bpf/bpf_helpers.h +++ b/tools/testing/selftests/bpf/bpf_helpers.h @@ -113,6 +113,8 @@ static int (*bpf_msg_pull_data)(void *ctx, int start, int end, int flags) = (void *) BPF_FUNC_msg_pull_data; static int (*bpf_msg_push_data)(void *ctx, int start, int end, int flags) = (void *) BPF_FUNC_msg_push_data; +static int (*bpf_msg_pop_data)(void *ctx, int start, int cut, int flags) = + (void *) BPF_FUNC_msg_pop_data; static int (*bpf_bind)(void *ctx, void *addr, int addr_len) = (void *) BPF_FUNC_bind; static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) = -- 2.7.4
[PATCH bpf-next 1/3] bpf: helper to pop data from messages
This adds a BPF SK_MSG program helper so that we can pop data from a msg. We use this to pop metadata from a previous push data call. Signed-off-by: John Fastabend --- include/uapi/linux/bpf.h | 13 +++- net/core/filter.c| 169 +++ net/ipv4/tcp_bpf.c | 14 +++- 3 files changed, 192 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c1554aa..64681f8 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2268,6 +2268,16 @@ union bpf_attr { * * Return * 0 on success, or a negative error in case of failure. + * + * int bpf_msg_pop_data(struct sk_msg_buff *msg, u32 start, u32 pop, u64 flags) + * Description + * Will remove 'pop' bytes from a msg starting at byte 'start'. + * This result in ENOMEM errors under certain situations where + * a allocation and copy are required due to a full ring buffer. + * However, the helper will try to avoid doing the allocation + * if possible. Other errors can occur if input parameters are + * invalid either do to start byte not being valid part of msg + * payload and/or pop value being to large. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -2360,7 +2370,8 @@ union bpf_attr { FN(map_push_elem), \ FN(map_pop_elem), \ FN(map_peek_elem), \ - FN(msg_push_data), + FN(msg_push_data), \ + FN(msg_pop_data), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/net/core/filter.c b/net/core/filter.c index f6ca38a..c6b35b5 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2428,6 +2428,173 @@ static const struct bpf_func_proto bpf_msg_push_data_proto = { .arg4_type = ARG_ANYTHING, }; +static void sk_msg_shift_left(struct sk_msg *msg, int i) +{ + int prev; + + do { + prev = i; + sk_msg_iter_var_next(i); + msg->sg.data[prev] = msg->sg.data[i]; + } while (i != msg->sg.end); + + sk_msg_iter_prev(msg, end); +} + +static void sk_msg_shift_right(struct sk_msg *msg, int i) +{ + struct scatterlist tmp, sge; + + sk_msg_iter_next(msg, end); + sge = sk_msg_elem_cpy(msg, i); + sk_msg_iter_var_next(i); + tmp = sk_msg_elem_cpy(msg, i); + + while (i != msg->sg.end) { + msg->sg.data[i] = sge; + sk_msg_iter_var_next(i); + sge = tmp; + tmp = sk_msg_elem_cpy(msg, i); + } +} + +BPF_CALL_4(bpf_msg_pop_data, struct sk_msg *, msg, u32, start, + u32, len, u64, flags) +{ + u32 i = 0, l, space, offset = 0; + u64 last = start + len; + int pop; + + if (unlikely(flags)) + return -EINVAL; + + /* First find the starting scatterlist element */ + i = msg->sg.start; + do { + l = sk_msg_elem(msg, i)->length; + + if (start < offset + l) + break; + offset += l; + sk_msg_iter_var_next(i); + } while (i != msg->sg.end); + + /* Bounds checks: start and pop must be inside message */ + if (start >= offset + l || last >= msg->sg.size) + return -EINVAL; + + space = MAX_MSG_FRAGS - sk_msg_elem_used(msg); + + pop = len; + /* --| offset +* -| start |--- len --| +* +* |- a | pop ---|- b | +* |__| length +* +* +* a: region at front of scatter element to save +* b: region at back of scatter element to save when length > A + pop +* pop: region to pop from element, same as input 'pop' here will be +* decremented below per iteration. +* +* Two top-level cases to handle when start != offset, first B is non +* zero and second B is zero corresponding to when a pop includes more +* than one element. +* +* Then if B is non-zero AND there is no space allocate space and +* compact A, B regions into page. If there is space shift ring to +* the rigth free'ing the next element in ring to place B, leaving +* A untouched except to reduce length. +*/ + if (start != offset) { + struct scatterlist *nsge, *sge = sk_msg_elem(msg, i); + int a = start; + int b = sge->length - pop - a; + + sk_msg_iter_var_next(i); + + if (pop < sge->length - a) { + if (space) { + sge->length = a; + sk_msg_shift_right(msg, i); +
[PATCH bpf-next 3/3] bpf: test_sockmap, add options for msg_pop_data()
Similar to msg_pull_data and msg_push_data add a set of options to have msg_pop_data() exercised. Signed-off-by: John Fastabend --- tools/testing/selftests/bpf/test_sockmap.c | 127 +++- tools/testing/selftests/bpf/test_sockmap_kern.h | 70 ++--- 2 files changed, 180 insertions(+), 17 deletions(-) diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c index 622ade0..e85a771 100644 --- a/tools/testing/selftests/bpf/test_sockmap.c +++ b/tools/testing/selftests/bpf/test_sockmap.c @@ -79,6 +79,8 @@ int txmsg_start; int txmsg_end; int txmsg_start_push; int txmsg_end_push; +int txmsg_start_pop; +int txmsg_pop; int txmsg_ingress; int txmsg_skb; int ktls; @@ -104,6 +106,8 @@ static const struct option long_options[] = { {"txmsg_end", required_argument, NULL, 'e'}, {"txmsg_start_push", required_argument, NULL, 'p'}, {"txmsg_end_push", required_argument, NULL, 'q'}, + {"txmsg_start_pop", required_argument, NULL, 'w'}, + {"txmsg_pop",required_argument, NULL, 'x'}, {"txmsg_ingress", no_argument, _ingress, 1 }, {"txmsg_skb", no_argument, _skb, 1 }, {"ktls", no_argument, , 1 }, @@ -473,13 +477,27 @@ static int msg_loop(int fd, int iov_count, int iov_length, int cnt, clock_gettime(CLOCK_MONOTONIC, >end); } else { int slct, recvp = 0, recv, max_fd = fd; + float total_bytes, txmsg_pop_total; int fd_flags = O_NONBLOCK; struct timeval timeout; - float total_bytes; fd_set w; fcntl(fd, fd_flags); + /* Account for pop bytes noting each iteration of apply will +* call msg_pop_data helper so we need to account for this +* by calculating the number of apply iterations. Note user +* of the tool can create cases where no data is sent by +* manipulating pop/push/pull/etc. For example txmsg_apply 1 +* with txmsg_pop 1 will try to apply 1B at a time but each +* iteration will then pop 1B so no data will ever be sent. +* This is really only useful for testing edge cases in code +* paths. +*/ total_bytes = (float)iov_count * (float)iov_length * (float)cnt; + txmsg_pop_total = txmsg_pop; + if (txmsg_apply) + txmsg_pop_total *= (total_bytes / txmsg_apply); + total_bytes -= txmsg_pop_total; err = clock_gettime(CLOCK_MONOTONIC, >start); if (err < 0) perror("recv start time: "); @@ -488,7 +506,7 @@ static int msg_loop(int fd, int iov_count, int iov_length, int cnt, timeout.tv_sec = 0; timeout.tv_usec = 30; } else { - timeout.tv_sec = 1; + timeout.tv_sec = 3; timeout.tv_usec = 0; } @@ -503,7 +521,7 @@ static int msg_loop(int fd, int iov_count, int iov_length, int cnt, goto out_errno; } else if (!slct) { if (opt->verbose) - fprintf(stderr, "unexpected timeout\n"); + fprintf(stderr, "unexpected timeout: recved %zu/%f pop_total %f\n", s->bytes_recvd, total_bytes, txmsg_pop_total); errno = -EIO; clock_gettime(CLOCK_MONOTONIC, >end); goto out_errno; @@ -619,7 +637,7 @@ static int sendmsg_test(struct sockmap_options *opt) iov_count = 1; err = msg_loop(rx_fd, iov_count, iov_buf, cnt, , false, opt); - if (err && opt->verbose) + if (opt->verbose) fprintf(stderr, "msg_loop_rx: iov_count %i iov_buf %i cnt %i err %i\n", iov_count, iov_buf, cnt, err); @@ -931,6 +949,39 @@ static int run_options(struct sockmap_options *options, int cg_fd, int test) } } + if (txmsg_start_pop) { + i = 4; + err = bpf_map_update_elem(map_fd[5], + , _start_pop, BPF_ANY); + if (err) { + fprintf(stderr, + "ERROR: bpf_map_update_elem %i@%i (txmsg_start_pop): %d (%s)\n", + txmsg_start_pop, i, err, strerror(errno)); +
Re: [PATCH v2 bpf-next] bpf: add skb->tstamp r/w access from tc clsact and cg skb progs
On Thu, Nov 22, 2018 at 02:39:16PM -0500, Vlad Dumitrescu wrote: > This could be used to rate limit egress traffic in concert with a qdisc > which supports Earliest Departure Time, such as FQ. > > Write access from cg skb progs only with CAP_SYS_ADMIN, since the value > will be used by downstream qdiscs. It might make sense to relax this. > > Changes v1 -> v2: > - allow access from cg skb, write only with CAP_SYS_ADMIN > > Signed-off-by: Vlad Dumitrescu Applied to bpf-next. I copied Eric's and Willem's Acks from v1, since v2 is essentially the same. Thanks everyone!
[PATCH net-next 12/12] rocker, dsa, ethsw: Don't filter VLAN events on bridge itself
Due to an explicit check in rocker_world_port_obj_vlan_add(), dsa_slave_switchdev_event() resp. port_switchdev_event(), VLAN objects that are added to a device that is not a front-panel port device are ignored. Therefore this check is immaterial. Signed-off-by: Petr Machata Acked-by: Jiri Pirko --- drivers/net/ethernet/rocker/rocker_main.c | 3 --- drivers/staging/fsl-dpaa2/ethsw/ethsw.c | 3 --- net/dsa/port.c| 3 --- 3 files changed, 9 deletions(-) diff --git a/drivers/net/ethernet/rocker/rocker_main.c b/drivers/net/ethernet/rocker/rocker_main.c index f05d5c1341b6..6213827e3956 100644 --- a/drivers/net/ethernet/rocker/rocker_main.c +++ b/drivers/net/ethernet/rocker/rocker_main.c @@ -1632,9 +1632,6 @@ rocker_world_port_obj_vlan_add(struct rocker_port *rocker_port, { struct rocker_world_ops *wops = rocker_port->rocker->wops; - if (netif_is_bridge_master(vlan->obj.orig_dev)) - return -EOPNOTSUPP; - if (!wops->port_obj_vlan_add) return -EOPNOTSUPP; diff --git a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c index 06a233c7cdd3..4fa37d6e598b 100644 --- a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c +++ b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c @@ -719,9 +719,6 @@ static int port_vlans_add(struct net_device *netdev, struct ethsw_port_priv *port_priv = netdev_priv(netdev); int vid, err = 0; - if (netif_is_bridge_master(vlan->obj.orig_dev)) - return -EOPNOTSUPP; - if (switchdev_trans_ph_prepare(trans)) return 0; diff --git a/net/dsa/port.c b/net/dsa/port.c index ed0595459df1..2d7e01b23572 100644 --- a/net/dsa/port.c +++ b/net/dsa/port.c @@ -252,9 +252,6 @@ int dsa_port_vlan_add(struct dsa_port *dp, .vlan = vlan, }; - if (netif_is_bridge_master(vlan->obj.orig_dev)) - return -EOPNOTSUPP; - if (br_vlan_enabled(dp->bridge_dev)) return dsa_port_notify(dp, DSA_NOTIFIER_VLAN_ADD, ); -- 2.4.11
[PATCH net-next 11/12] switchdev: Replace port obj add/del SDO with a notification
Drop switchdev_ops.switchdev_port_obj_add and _del. Drop the uses of this field from all clients, which were migrated to use switchdev notification in the previous patches. Add a new function switchdev_port_obj_notify() that sends the switchdev notifications SWITCHDEV_PORT_OBJ_ADD and _DEL. Update switchdev_port_obj_del_now() to dispatch to this new function. Drop __switchdev_port_obj_add() and update switchdev_port_obj_add() likewise. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- .../ethernet/mellanox/mlxsw/spectrum_switchdev.c | 2 - drivers/net/ethernet/mscc/ocelot.c | 2 - drivers/net/ethernet/rocker/rocker_main.c | 2 - drivers/staging/fsl-dpaa2/ethsw/ethsw.c| 2 - include/net/switchdev.h| 9 --- net/dsa/slave.c| 2 - net/switchdev/switchdev.c | 67 -- 7 files changed, 25 insertions(+), 61 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c index 3756aaecd39c..73e5db176d7e 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c @@ -1968,8 +1968,6 @@ static struct mlxsw_sp_port *mlxsw_sp_lag_rep_port(struct mlxsw_sp *mlxsw_sp, static const struct switchdev_ops mlxsw_sp_port_switchdev_ops = { .switchdev_port_attr_get= mlxsw_sp_port_attr_get, .switchdev_port_attr_set= mlxsw_sp_port_attr_set, - .switchdev_port_obj_add = mlxsw_sp_port_obj_add, - .switchdev_port_obj_del = mlxsw_sp_port_obj_del, }; static int diff --git a/drivers/net/ethernet/mscc/ocelot.c b/drivers/net/ethernet/mscc/ocelot.c index 01403b530522..7f8da8873a96 100644 --- a/drivers/net/ethernet/mscc/ocelot.c +++ b/drivers/net/ethernet/mscc/ocelot.c @@ -1337,8 +1337,6 @@ static int ocelot_port_obj_del(struct net_device *dev, static const struct switchdev_ops ocelot_port_switchdev_ops = { .switchdev_port_attr_get= ocelot_port_attr_get, .switchdev_port_attr_set= ocelot_port_attr_set, - .switchdev_port_obj_add = ocelot_port_obj_add, - .switchdev_port_obj_del = ocelot_port_obj_del, }; static int ocelot_port_bridge_join(struct ocelot_port *ocelot_port, diff --git a/drivers/net/ethernet/rocker/rocker_main.c b/drivers/net/ethernet/rocker/rocker_main.c index 806ffe1d906e..f05d5c1341b6 100644 --- a/drivers/net/ethernet/rocker/rocker_main.c +++ b/drivers/net/ethernet/rocker/rocker_main.c @@ -2145,8 +2145,6 @@ static int rocker_port_obj_del(struct net_device *dev, static const struct switchdev_ops rocker_port_switchdev_ops = { .switchdev_port_attr_get= rocker_port_attr_get, .switchdev_port_attr_set= rocker_port_attr_set, - .switchdev_port_obj_add = rocker_port_obj_add, - .switchdev_port_obj_del = rocker_port_obj_del, }; struct rocker_fib_event_work { diff --git a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c index 83e1d92dc7f3..06a233c7cdd3 100644 --- a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c +++ b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c @@ -930,8 +930,6 @@ static int swdev_port_obj_del(struct net_device *netdev, static const struct switchdev_ops ethsw_port_switchdev_ops = { .switchdev_port_attr_get= swdev_port_attr_get, .switchdev_port_attr_set= swdev_port_attr_set, - .switchdev_port_obj_add = swdev_port_obj_add, - .switchdev_port_obj_del = swdev_port_obj_del, }; /* For the moment, only flood setting needs to be updated */ diff --git a/include/net/switchdev.h b/include/net/switchdev.h index 6dc7de576167..866b6d148b77 100644 --- a/include/net/switchdev.h +++ b/include/net/switchdev.h @@ -121,10 +121,6 @@ typedef int switchdev_obj_dump_cb_t(struct switchdev_obj *obj); * @switchdev_port_attr_get: Get a port attribute (see switchdev_attr). * * @switchdev_port_attr_set: Set a port attribute (see switchdev_attr). - * - * @switchdev_port_obj_add: Add an object to port (see switchdev_obj_*). - * - * @switchdev_port_obj_del: Delete an object from port (see switchdev_obj_*). */ struct switchdev_ops { int (*switchdev_port_attr_get)(struct net_device *dev, @@ -132,11 +128,6 @@ struct switchdev_ops { int (*switchdev_port_attr_set)(struct net_device *dev, const struct switchdev_attr *attr, struct switchdev_trans *trans); - int (*switchdev_port_obj_add)(struct net_device *dev, - const struct switchdev_obj *obj, - struct switchdev_trans *trans); - int (*switchdev_port_obj_del)(struct net_device *dev, -
[PATCH net-next 10/12] ocelot: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL
Following patches will change the way of distributing port object changes from a switchdev operation to a switchdev notifier. The switchdev code currently recursively descends through layers of lower devices, eventually calling the op on a front-panel port device. The notifier will instead be sent referencing the bridge port device, which may be a stacking device that's one of front-panel ports uppers, or a completely unrelated device. Dispatch the new events to ocelot_port_obj_add() resp. _del() to maintain the same behavior that the switchdev operation based code currently has. Pass through switchdev_handle_port_obj_add() / _del() to handle the recursive descend, because Ocelot supports LAG uppers. Register to the new switchdev blocking notifier chain to get the new events when they start getting distributed. Signed-off-by: Petr Machata Acked-by: Jiri Pirko --- drivers/net/ethernet/mscc/ocelot.c | 28 drivers/net/ethernet/mscc/ocelot.h | 1 + drivers/net/ethernet/mscc/ocelot_board.c | 3 +++ 3 files changed, 32 insertions(+) diff --git a/drivers/net/ethernet/mscc/ocelot.c b/drivers/net/ethernet/mscc/ocelot.c index 3238b9ee42f3..01403b530522 100644 --- a/drivers/net/ethernet/mscc/ocelot.c +++ b/drivers/net/ethernet/mscc/ocelot.c @@ -1595,6 +1595,34 @@ struct notifier_block ocelot_netdevice_nb __read_mostly = { }; EXPORT_SYMBOL(ocelot_netdevice_nb); +static int ocelot_switchdev_blocking_event(struct notifier_block *unused, + unsigned long event, void *ptr) +{ + struct net_device *dev = switchdev_notifier_info_to_dev(ptr); + int err; + + switch (event) { + /* Blocking events. */ + case SWITCHDEV_PORT_OBJ_ADD: + err = switchdev_handle_port_obj_add(dev, ptr, + ocelot_netdevice_dev_check, + ocelot_port_obj_add); + return notifier_from_errno(err); + case SWITCHDEV_PORT_OBJ_DEL: + err = switchdev_handle_port_obj_del(dev, ptr, + ocelot_netdevice_dev_check, + ocelot_port_obj_del); + return notifier_from_errno(err); + } + + return NOTIFY_DONE; +} + +struct notifier_block ocelot_switchdev_blocking_nb __read_mostly = { + .notifier_call = ocelot_switchdev_blocking_event, +}; +EXPORT_SYMBOL(ocelot_switchdev_blocking_nb); + int ocelot_probe_port(struct ocelot *ocelot, u8 port, void __iomem *regs, struct phy_device *phy) diff --git a/drivers/net/ethernet/mscc/ocelot.h b/drivers/net/ethernet/mscc/ocelot.h index 62c7c8eb00d9..086775f7b52f 100644 --- a/drivers/net/ethernet/mscc/ocelot.h +++ b/drivers/net/ethernet/mscc/ocelot.h @@ -499,5 +499,6 @@ int ocelot_probe_port(struct ocelot *ocelot, u8 port, struct phy_device *phy); extern struct notifier_block ocelot_netdevice_nb; +extern struct notifier_block ocelot_switchdev_blocking_nb; #endif diff --git a/drivers/net/ethernet/mscc/ocelot_board.c b/drivers/net/ethernet/mscc/ocelot_board.c index 4c23d18bbf44..ca3ea2fbfcd0 100644 --- a/drivers/net/ethernet/mscc/ocelot_board.c +++ b/drivers/net/ethernet/mscc/ocelot_board.c @@ -12,6 +12,7 @@ #include #include #include +#include #include "ocelot.h" @@ -328,6 +329,7 @@ static int mscc_ocelot_probe(struct platform_device *pdev) } register_netdevice_notifier(_netdevice_nb); + register_switchdev_blocking_notifier(_switchdev_blocking_nb); dev_info(>dev, "Ocelot switch probed\n"); @@ -342,6 +344,7 @@ static int mscc_ocelot_remove(struct platform_device *pdev) struct ocelot *ocelot = platform_get_drvdata(pdev); ocelot_deinit(ocelot); + unregister_switchdev_blocking_notifier(_switchdev_blocking_nb); unregister_netdevice_notifier(_netdevice_nb); return 0; -- 2.4.11
[PATCH net-next 09/12] mlxsw: spectrum_switchdev: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL
Following patches will change the way of distributing port object changes from a switchdev operation to a switchdev notifier. The switchdev code currently recursively descends through layers of lower devices, eventually calling the op on a front-panel port device. The notifier will instead be sent referencing the bridge port device, which may be a stacking device that's one of front-panel ports uppers, or a completely unrelated device. To handle SWITCHDEV_PORT_OBJ_ADD and _DEL, subscribe to the blocking notifier chain. Dispatch to mlxsw_sp_port_obj_add() resp. _del() to maintain the behavior that the switchdev operation based code currently has. Defer to switchdev_handle_port_obj_add() / _del() to handle the recursive descend, because mlxsw supports a number of upper types. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- .../ethernet/mellanox/mlxsw/spectrum_switchdev.c | 45 +- 1 file changed, 44 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c index b32a5ee57fb9..3756aaecd39c 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c @@ -3118,6 +3118,32 @@ static struct notifier_block mlxsw_sp_switchdev_notifier = { .notifier_call = mlxsw_sp_switchdev_event, }; +static int mlxsw_sp_switchdev_blocking_event(struct notifier_block *unused, +unsigned long event, void *ptr) +{ + struct net_device *dev = switchdev_notifier_info_to_dev(ptr); + int err; + + switch (event) { + case SWITCHDEV_PORT_OBJ_ADD: + err = switchdev_handle_port_obj_add(dev, ptr, + mlxsw_sp_port_dev_check, + mlxsw_sp_port_obj_add); + return notifier_from_errno(err); + case SWITCHDEV_PORT_OBJ_DEL: + err = switchdev_handle_port_obj_del(dev, ptr, + mlxsw_sp_port_dev_check, + mlxsw_sp_port_obj_del); + return notifier_from_errno(err); + } + + return NOTIFY_DONE; +} + +static struct notifier_block mlxsw_sp_switchdev_blocking_notifier = { + .notifier_call = mlxsw_sp_switchdev_blocking_event, +}; + u8 mlxsw_sp_bridge_port_stp_state(struct mlxsw_sp_bridge_port *bridge_port) { @@ -3127,6 +3153,7 @@ mlxsw_sp_bridge_port_stp_state(struct mlxsw_sp_bridge_port *bridge_port) static int mlxsw_sp_fdb_init(struct mlxsw_sp *mlxsw_sp) { struct mlxsw_sp_bridge *bridge = mlxsw_sp->bridge; + struct notifier_block *nb; int err; err = mlxsw_sp_ageing_set(mlxsw_sp, MLXSW_SP_DEFAULT_AGEING_TIME); @@ -3141,17 +3168,33 @@ static int mlxsw_sp_fdb_init(struct mlxsw_sp *mlxsw_sp) return err; } + nb = _sp_switchdev_blocking_notifier; + err = register_switchdev_blocking_notifier(nb); + if (err) { + dev_err(mlxsw_sp->bus_info->dev, "Failed to register switchdev blocking notifier\n"); + goto err_register_switchdev_blocking_notifier; + } + INIT_DELAYED_WORK(>fdb_notify.dw, mlxsw_sp_fdb_notify_work); bridge->fdb_notify.interval = MLXSW_SP_DEFAULT_LEARNING_INTERVAL; mlxsw_sp_fdb_notify_work_schedule(mlxsw_sp); return 0; + +err_register_switchdev_blocking_notifier: + unregister_switchdev_notifier(_sp_switchdev_notifier); + return err; } static void mlxsw_sp_fdb_fini(struct mlxsw_sp *mlxsw_sp) { + struct notifier_block *nb; + cancel_delayed_work_sync(_sp->bridge->fdb_notify.dw); - unregister_switchdev_notifier(_sp_switchdev_notifier); + nb = _sp_switchdev_blocking_notifier; + unregister_switchdev_blocking_notifier(nb); + + unregister_switchdev_notifier(_sp_switchdev_notifier); } int mlxsw_sp_switchdev_init(struct mlxsw_sp *mlxsw_sp) -- 2.4.11
[PATCH net-next 08/12] switchdev: Add helpers to aid traversal through lower devices
After the transition from switchdev operations to notifier chain (which will take place in following patches), the onus is on the driver to find its own devices below possible layer of LAG or other uppers. The logic to do so is fairly repetitive: each driver is looking for its own devices among the lowers of the notified device. For those that it finds, it calls a handler. To indicate that the event was handled, struct switchdev_notifier_port_obj_info.handled is set. The differences lie only in what constitutes an "own" device and what handler to call. Therefore abstract this logic into two helpers, switchdev_handle_port_obj_add() and switchdev_handle_port_obj_del(). If a driver only supports physical ports under a bridge device, it will simply avoid this layer of indirection. One area where this helper diverges from the current switchdev behavior is the case of mixed lowers, some of which are switchdev ports and some of which are not. Previously, such scenario would fail with -EOPNOTSUPP. The helper could do that for lowers for which the passed-in predicate doesn't hold. That would however break the case that switchdev ports from several different drivers are stashed under one master, a scenario that switchdev currently happily supports. Therefore tolerate any and all unknown netdevices, whether they are backed by a switchdev driver or not. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- include/net/switchdev.h | 33 +++ net/switchdev/switchdev.c | 100 ++ 2 files changed, 133 insertions(+) diff --git a/include/net/switchdev.h b/include/net/switchdev.h index a2f3ebf39301..6dc7de576167 100644 --- a/include/net/switchdev.h +++ b/include/net/switchdev.h @@ -210,6 +210,18 @@ void switchdev_port_fwd_mark_set(struct net_device *dev, bool switchdev_port_same_parent_id(struct net_device *a, struct net_device *b); +int switchdev_handle_port_obj_add(struct net_device *dev, + struct switchdev_notifier_port_obj_info *port_obj_info, + bool (*check_cb)(const struct net_device *dev), + int (*add_cb)(struct net_device *dev, + const struct switchdev_obj *obj, + struct switchdev_trans *trans)); +int switchdev_handle_port_obj_del(struct net_device *dev, + struct switchdev_notifier_port_obj_info *port_obj_info, + bool (*check_cb)(const struct net_device *dev), + int (*del_cb)(struct net_device *dev, + const struct switchdev_obj *obj)); + #define SWITCHDEV_SET_OPS(netdev, ops) ((netdev)->switchdev_ops = (ops)) #else @@ -284,6 +296,27 @@ static inline bool switchdev_port_same_parent_id(struct net_device *a, return false; } +static inline int +switchdev_handle_port_obj_add(struct net_device *dev, + struct switchdev_notifier_port_obj_info *port_obj_info, + bool (*check_cb)(const struct net_device *dev), + int (*add_cb)(struct net_device *dev, + const struct switchdev_obj *obj, + struct switchdev_trans *trans)) +{ + return 0; +} + +static inline int +switchdev_handle_port_obj_del(struct net_device *dev, + struct switchdev_notifier_port_obj_info *port_obj_info, + bool (*check_cb)(const struct net_device *dev), + int (*del_cb)(struct net_device *dev, + const struct switchdev_obj *obj)) +{ + return 0; +} + #define SWITCHDEV_SET_OPS(netdev, ops) do {} while (0) #endif diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c index e109bb97ce3f..099434ec7996 100644 --- a/net/switchdev/switchdev.c +++ b/net/switchdev/switchdev.c @@ -621,3 +621,103 @@ bool switchdev_port_same_parent_id(struct net_device *a, return netdev_phys_item_id_same(_attr.u.ppid, _attr.u.ppid); } EXPORT_SYMBOL_GPL(switchdev_port_same_parent_id); + +static int __switchdev_handle_port_obj_add(struct net_device *dev, + struct switchdev_notifier_port_obj_info *port_obj_info, + bool (*check_cb)(const struct net_device *dev), + int (*add_cb)(struct net_device *dev, + const struct switchdev_obj *obj, + struct switchdev_trans *trans)) +{ + struct net_device *lower_dev; + struct list_head *iter; + int err = -EOPNOTSUPP; + + if (check_cb(dev)) { + /* This flag is only checked if the return value is success. */ + port_obj_info->handled = true; + return add_cb(dev, port_obj_info->obj, port_obj_info->trans); + } + + /* Switch
[PATCH net-next 07/12] staging: fsl-dpaa2: ethsw: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL
Following patches will change the way of distributing port object changes from a switchdev operation to a switchdev notifier. The switchdev code currently recursively descends through layers of lower devices, eventually calling the op on a front-panel port device. The notifier will instead be sent referencing the bridge port device, which may be a stacking device that's one of front-panel ports uppers, or a completely unrelated device. ethsw currently doesn't support any uppers other than bridge. SWITCHDEV_OBJ_ID_HOST_MDB and _PORT_MDB objects are always notified on the bridge port device. Thus the only case that a stacked device could be validly referenced by port object notifications are bridge notifications for VLAN objects added to the bridge itself. But the driver explicitly rejects such notifications in port_vlans_add(). It is therefore safe to assume that the only interesting case is that the notification is on a front-panel port netdevice. To handle SWITCHDEV_PORT_OBJ_ADD and _DEL, subscribe to the blocking notifier chain. Dispatch to swdev_port_obj_add() resp. _del() to maintain the behavior that the switchdev operation based code currently has. Signed-off-by: Petr Machata Acked-by: Jiri Pirko --- drivers/staging/fsl-dpaa2/ethsw/ethsw.c | 56 + 1 file changed, 56 insertions(+) diff --git a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c index e379b0fa936f..83e1d92dc7f3 100644 --- a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c +++ b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c @@ -1088,10 +1088,51 @@ static int port_switchdev_event(struct notifier_block *unused, return NOTIFY_BAD; } +static int +ethsw_switchdev_port_obj_event(unsigned long event, struct net_device *netdev, + struct switchdev_notifier_port_obj_info *port_obj_info) +{ + int err = -EOPNOTSUPP; + + switch (event) { + case SWITCHDEV_PORT_OBJ_ADD: + err = swdev_port_obj_add(netdev, port_obj_info->obj, +port_obj_info->trans); + break; + case SWITCHDEV_PORT_OBJ_DEL: + err = swdev_port_obj_del(netdev, port_obj_info->obj); + break; + } + + port_obj_info->handled = true; + return notifier_from_errno(err); +} + +static int port_switchdev_blocking_event(struct notifier_block *unused, +unsigned long event, void *ptr) +{ + struct net_device *dev = switchdev_notifier_info_to_dev(ptr); + + if (!ethsw_port_dev_check(dev)) + return NOTIFY_DONE; + + switch (event) { + case SWITCHDEV_PORT_OBJ_ADD: /* fall through */ + case SWITCHDEV_PORT_OBJ_DEL: + return ethsw_switchdev_port_obj_event(event, dev, ptr); + } + + return NOTIFY_DONE; +} + static struct notifier_block port_switchdev_nb = { .notifier_call = port_switchdev_event, }; +static struct notifier_block port_switchdev_blocking_nb = { + .notifier_call = port_switchdev_blocking_event, +}; + static int ethsw_register_notifier(struct device *dev) { int err; @@ -1108,8 +1149,16 @@ static int ethsw_register_notifier(struct device *dev) goto err_switchdev_nb; } + err = register_switchdev_blocking_notifier(_switchdev_blocking_nb); + if (err) { + dev_err(dev, "Failed to register switchdev blocking notifier\n"); + goto err_switchdev_blocking_nb; + } + return 0; +err_switchdev_blocking_nb: + unregister_switchdev_notifier(_switchdev_nb); err_switchdev_nb: unregister_netdevice_notifier(_nb); return err; @@ -1296,8 +1345,15 @@ static int ethsw_port_init(struct ethsw_port_priv *port_priv, u16 port) static void ethsw_unregister_notifier(struct device *dev) { + struct notifier_block *nb; int err; + nb = _switchdev_blocking_nb; + err = unregister_switchdev_blocking_notifier(nb); + if (err) + dev_err(dev, + "Failed to unregister switchdev blocking notifier (%d)\n", err); + err = unregister_switchdev_notifier(_switchdev_nb); if (err) dev_err(dev, -- 2.4.11
[PATCH net-next 04/12] rocker: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL
Following patches will change the way of distributing port object changes from a switchdev operation to a switchdev notifier. The switchdev code currently recursively descends through layers of lower devices, eventually calling the op on a front-panel port device. The notifier will instead be sent referencing the bridge port device, which may be a stacking device that's one of front-panel ports uppers, or a completely unrelated device. rocker currently doesn't support any uppers other than bridge. Thus the only case that a stacked device could be validly referenced by port object notifications are bridge notifications for VLAN objects added to the bridge itself. But the driver explicitly rejects such notifications in rocker_world_port_obj_vlan_add(). It is therefore safe to assume that the only interesting case is that the notification is on a front-panel port netdevice. Subscribe to the blocking notifier chain. In the handler, filter out notifications on any foreign netdevices. Dispatch the new notifiers to rocker_port_obj_add() resp. _del() to maintain the behavior that the switchdev operation based code currently has. Signed-off-by: Petr Machata Acked-by: Jiri Pirko --- drivers/net/ethernet/rocker/rocker_main.c | 55 +++ 1 file changed, 55 insertions(+) diff --git a/drivers/net/ethernet/rocker/rocker_main.c b/drivers/net/ethernet/rocker/rocker_main.c index beb06628f22d..806ffe1d906e 100644 --- a/drivers/net/ethernet/rocker/rocker_main.c +++ b/drivers/net/ethernet/rocker/rocker_main.c @@ -2812,12 +2812,54 @@ static int rocker_switchdev_event(struct notifier_block *unused, return NOTIFY_DONE; } +static int +rocker_switchdev_port_obj_event(unsigned long event, struct net_device *netdev, + struct switchdev_notifier_port_obj_info *port_obj_info) +{ + int err = -EOPNOTSUPP; + + switch (event) { + case SWITCHDEV_PORT_OBJ_ADD: + err = rocker_port_obj_add(netdev, port_obj_info->obj, + port_obj_info->trans); + break; + case SWITCHDEV_PORT_OBJ_DEL: + err = rocker_port_obj_del(netdev, port_obj_info->obj); + break; + } + + port_obj_info->handled = true; + return notifier_from_errno(err); +} + +static int rocker_switchdev_blocking_event(struct notifier_block *unused, + unsigned long event, void *ptr) +{ + struct net_device *dev = switchdev_notifier_info_to_dev(ptr); + + if (!rocker_port_dev_check(dev)) + return NOTIFY_DONE; + + switch (event) { + case SWITCHDEV_PORT_OBJ_ADD: + case SWITCHDEV_PORT_OBJ_DEL: + return rocker_switchdev_port_obj_event(event, dev, ptr); + } + + return NOTIFY_DONE; +} + static struct notifier_block rocker_switchdev_notifier = { .notifier_call = rocker_switchdev_event, }; +static struct notifier_block rocker_switchdev_blocking_notifier = { + .notifier_call = rocker_switchdev_blocking_event, +}; + static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id) { + struct notifier_block *nb; struct rocker *rocker; int err; @@ -2933,6 +2975,13 @@ static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id) goto err_register_switchdev_notifier; } + nb = _switchdev_blocking_notifier; + err = register_switchdev_blocking_notifier(nb); + if (err) { + dev_err(>dev, "Failed to register switchdev blocking notifier\n"); + goto err_register_switchdev_blocking_notifier; + } + rocker->hw.id = rocker_read64(rocker, SWITCH_ID); dev_info(>dev, "Rocker switch with id %*phN\n", @@ -2940,6 +2989,8 @@ static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id) return 0; +err_register_switchdev_blocking_notifier: + unregister_switchdev_notifier(_switchdev_notifier); err_register_switchdev_notifier: unregister_fib_notifier(>fib_nb); err_register_fib_notifier: @@ -2971,6 +3022,10 @@ static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id) static void rocker_remove(struct pci_dev *pdev) { struct rocker *rocker = pci_get_drvdata(pdev); + struct notifier_block *nb; + + nb = _switchdev_blocking_notifier; + unregister_switchdev_blocking_notifier(nb); unregister_switchdev_notifier(_switchdev_notifier); unregister_fib_notifier(>fib_nb); -- 2.4.11
[PATCH net-next 06/12] staging: fsl-dpaa2: ethsw: Introduce ethsw_port_dev_check()
ethsw currently uses an open-coded comparison of netdev_ops to determine whether whether a device represents a front panel port. Wrap this into a named function to simplify reuse. Signed-off-by: Petr Machata Acked-by: Jiri Pirko --- drivers/staging/fsl-dpaa2/ethsw/ethsw.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c index 7a7ca67822c5..e379b0fa936f 100644 --- a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c +++ b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c @@ -972,6 +972,11 @@ static int port_bridge_leave(struct net_device *netdev) return err; } +static bool ethsw_port_dev_check(const struct net_device *netdev) +{ + return netdev->netdev_ops == _port_ops; +} + static int port_netdevice_event(struct notifier_block *unused, unsigned long event, void *ptr) { @@ -980,7 +985,7 @@ static int port_netdevice_event(struct notifier_block *unused, struct net_device *upper_dev; int err = 0; - if (netdev->netdev_ops != _port_ops) + if (!ethsw_port_dev_check(netdev)) return NOTIFY_DONE; /* Handle just upper dev link/unlink for the moment */ -- 2.4.11
[PATCH net-next 05/12] net: dsa: slave: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL
Following patches will change the way of distributing port object changes from a switchdev operation to a switchdev notifier. The switchdev code currently recursively descends through layers of lower devices, eventually calling the op on a front-panel port device. The notifier will instead be sent referencing the bridge port device, which may be a stacking device that's one of front-panel ports uppers, or a completely unrelated device. DSA currently doesn't support any other uppers than bridge. SWITCHDEV_OBJ_ID_HOST_MDB and _PORT_MDB objects are always notified on the bridge port device. Thus the only case that a stacked device could be validly referenced by port object notifications are bridge notifications for VLAN objects added to the bridge itself. But the driver explicitly rejects such notifications in dsa_port_vlan_add(). It is therefore safe to assume that the only interesting case is that the notification is on a front-panel port netdevice. Therefore keep the filtering by dsa_slave_dev_check() in place. To handle SWITCHDEV_PORT_OBJ_ADD and _DEL, subscribe to the blocking notifier chain. Dispatch to rocker_port_obj_add() resp. _del() to maintain the behavior that the switchdev operation based code currently has. Signed-off-by: Petr Machata Acked-by: Jiri Pirko --- net/dsa/slave.c | 56 1 file changed, 56 insertions(+) diff --git a/net/dsa/slave.c b/net/dsa/slave.c index 7d0c19e7edcf..d00a0b6d4ce0 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -1557,6 +1557,44 @@ static int dsa_slave_switchdev_event(struct notifier_block *unused, return NOTIFY_BAD; } +static int +dsa_slave_switchdev_port_obj_event(unsigned long event, + struct net_device *netdev, + struct switchdev_notifier_port_obj_info *port_obj_info) +{ + int err = -EOPNOTSUPP; + + switch (event) { + case SWITCHDEV_PORT_OBJ_ADD: + err = dsa_slave_port_obj_add(netdev, port_obj_info->obj, +port_obj_info->trans); + break; + case SWITCHDEV_PORT_OBJ_DEL: + err = dsa_slave_port_obj_del(netdev, port_obj_info->obj); + break; + } + + port_obj_info->handled = true; + return notifier_from_errno(err); +} + +static int dsa_slave_switchdev_blocking_event(struct notifier_block *unused, + unsigned long event, void *ptr) +{ + struct net_device *dev = switchdev_notifier_info_to_dev(ptr); + + if (!dsa_slave_dev_check(dev)) + return NOTIFY_DONE; + + switch (event) { + case SWITCHDEV_PORT_OBJ_ADD: /* fall through */ + case SWITCHDEV_PORT_OBJ_DEL: + return dsa_slave_switchdev_port_obj_event(event, dev, ptr); + } + + return NOTIFY_DONE; +} + static struct notifier_block dsa_slave_nb __read_mostly = { .notifier_call = dsa_slave_netdevice_event, }; @@ -1565,8 +1603,13 @@ static struct notifier_block dsa_slave_switchdev_notifier = { .notifier_call = dsa_slave_switchdev_event, }; +static struct notifier_block dsa_slave_switchdev_blocking_notifier = { + .notifier_call = dsa_slave_switchdev_blocking_event, +}; + int dsa_slave_register_notifier(void) { + struct notifier_block *nb; int err; err = register_netdevice_notifier(_slave_nb); @@ -1577,8 +1620,15 @@ int dsa_slave_register_notifier(void) if (err) goto err_switchdev_nb; + nb = _slave_switchdev_blocking_notifier; + err = register_switchdev_blocking_notifier(nb); + if (err) + goto err_switchdev_blocking_nb; + return 0; +err_switchdev_blocking_nb: + unregister_switchdev_notifier(_slave_switchdev_notifier); err_switchdev_nb: unregister_netdevice_notifier(_slave_nb); return err; @@ -1586,8 +1636,14 @@ int dsa_slave_register_notifier(void) void dsa_slave_unregister_notifier(void) { + struct notifier_block *nb; int err; + nb = _slave_switchdev_blocking_notifier; + err = unregister_switchdev_blocking_notifier(nb); + if (err) + pr_err("DSA: failed to unregister switchdev blocking notifier (%d)\n", err); + err = unregister_switchdev_notifier(_slave_switchdev_notifier); if (err) pr_err("DSA: failed to unregister switchdev notifier (%d)\n", err); -- 2.4.11
[PATCH net-next 02/12] switchdev: Add a blocking notifier chain
In general one can't assume that a switchdev notifier is called in a non-atomic context, and correspondingly, the switchdev notifier chain is an atomic one. However, port object addition and deletion messages are delivered from a process context. Even the MDB addition messages, whose delivery is scheduled from atomic context, are queued and the delivery itself takes place in blocking context. For VLAN messages in particular, keeping the blocking nature is important for error reporting. Therefore introduce a blocking notifier chain and related service functions to distribute the notifications for which a blocking context can be assumed. Signed-off-by: Petr Machata Reviewed-by: Jiri Pirko Reviewed-by: Ido Schimmel --- include/net/switchdev.h | 27 +++ net/switchdev/switchdev.c | 26 ++ 2 files changed, 53 insertions(+) diff --git a/include/net/switchdev.h b/include/net/switchdev.h index dd969224a9b9..e021b67b9b32 100644 --- a/include/net/switchdev.h +++ b/include/net/switchdev.h @@ -182,10 +182,17 @@ int switchdev_port_obj_add(struct net_device *dev, const struct switchdev_obj *obj); int switchdev_port_obj_del(struct net_device *dev, const struct switchdev_obj *obj); + int register_switchdev_notifier(struct notifier_block *nb); int unregister_switchdev_notifier(struct notifier_block *nb); int call_switchdev_notifiers(unsigned long val, struct net_device *dev, struct switchdev_notifier_info *info); + +int register_switchdev_blocking_notifier(struct notifier_block *nb); +int unregister_switchdev_blocking_notifier(struct notifier_block *nb); +int call_switchdev_blocking_notifiers(unsigned long val, struct net_device *dev, + struct switchdev_notifier_info *info); + void switchdev_port_fwd_mark_set(struct net_device *dev, struct net_device *group_dev, bool joining); @@ -241,6 +248,26 @@ static inline int call_switchdev_notifiers(unsigned long val, return NOTIFY_DONE; } +static inline int +register_switchdev_blocking_notifier(struct notifier_block *nb) +{ + return 0; +} + +static inline int +unregister_switchdev_blocking_notifier(struct notifier_block *nb) +{ + return 0; +} + +static inline int +call_switchdev_blocking_notifiers(unsigned long val, + struct net_device *dev, + struct switchdev_notifier_info *info) +{ + return NOTIFY_DONE; +} + static inline bool switchdev_port_same_parent_id(struct net_device *a, struct net_device *b) { diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c index 74b9d916a58b..e109bb97ce3f 100644 --- a/net/switchdev/switchdev.c +++ b/net/switchdev/switchdev.c @@ -535,6 +535,7 @@ int switchdev_port_obj_del(struct net_device *dev, EXPORT_SYMBOL_GPL(switchdev_port_obj_del); static ATOMIC_NOTIFIER_HEAD(switchdev_notif_chain); +static BLOCKING_NOTIFIER_HEAD(switchdev_blocking_notif_chain); /** * register_switchdev_notifier - Register notifier @@ -576,6 +577,31 @@ int call_switchdev_notifiers(unsigned long val, struct net_device *dev, } EXPORT_SYMBOL_GPL(call_switchdev_notifiers); +int register_switchdev_blocking_notifier(struct notifier_block *nb) +{ + struct blocking_notifier_head *chain = _blocking_notif_chain; + + return blocking_notifier_chain_register(chain, nb); +} +EXPORT_SYMBOL_GPL(register_switchdev_blocking_notifier); + +int unregister_switchdev_blocking_notifier(struct notifier_block *nb) +{ + struct blocking_notifier_head *chain = _blocking_notif_chain; + + return blocking_notifier_chain_unregister(chain, nb); +} +EXPORT_SYMBOL_GPL(unregister_switchdev_blocking_notifier); + +int call_switchdev_blocking_notifiers(unsigned long val, struct net_device *dev, + struct switchdev_notifier_info *info) +{ + info->dev = dev; + return blocking_notifier_call_chain(_blocking_notif_chain, + val, info); +} +EXPORT_SYMBOL_GPL(call_switchdev_blocking_notifiers); + bool switchdev_port_same_parent_id(struct net_device *a, struct net_device *b) { -- 2.4.11
[PATCH net-next 01/12] switchdev: SWITCHDEV_OBJ_PORT_{VLAN, MDB}(): Sanitize
The two macros SWITCHDEV_OBJ_PORT_VLAN() and SWITCHDEV_OBJ_PORT_MDB() expand to a container_of() call, yielding an appropriate container of their sole argument. However, due to a name collision, the first argument, i.e. the contained object pointer, is not the only one to get expanded. The third argument, which is a structure member name, and should be kept literal, gets expanded as well. The only safe way to use these two macros is therefore to name the local variable passed to them "obj". To fix this, rename the sole argument of the two macros from "obj" (which collides with the member name) to "OBJ". Additionally, instead of passing "OBJ" to container_of() verbatim, parenthesize it, so that a comma in the passed-in expression doesn't pollute the container_of() invocation. Signed-off-by: Petr Machata Acked-by: Jiri Pirko Reviewed-by: Ido Schimmel --- include/net/switchdev.h | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/include/net/switchdev.h b/include/net/switchdev.h index 7b371e7c4bc6..dd969224a9b9 100644 --- a/include/net/switchdev.h +++ b/include/net/switchdev.h @@ -95,8 +95,8 @@ struct switchdev_obj_port_vlan { u16 vid_end; }; -#define SWITCHDEV_OBJ_PORT_VLAN(obj) \ - container_of(obj, struct switchdev_obj_port_vlan, obj) +#define SWITCHDEV_OBJ_PORT_VLAN(OBJ) \ + container_of((OBJ), struct switchdev_obj_port_vlan, obj) /* SWITCHDEV_OBJ_ID_PORT_MDB */ struct switchdev_obj_port_mdb { @@ -105,8 +105,8 @@ struct switchdev_obj_port_mdb { u16 vid; }; -#define SWITCHDEV_OBJ_PORT_MDB(obj) \ - container_of(obj, struct switchdev_obj_port_mdb, obj) +#define SWITCHDEV_OBJ_PORT_MDB(OBJ) \ + container_of((OBJ), struct switchdev_obj_port_mdb, obj) void switchdev_trans_item_enqueue(struct switchdev_trans *trans, void *data, void (*destructor)(void const *), -- 2.4.11
[PATCH net-next 03/12] switchdev: Add SWITCHDEV_PORT_OBJ_ADD, SWITCHDEV_PORT_OBJ_DEL
An offloading driver may need to have access to switchdev events on ports that aren't directly under its control. An example is a VXLAN port attached to a bridge offloaded by a driver. The driver needs to know about VLANs configured on the VXLAN device. However the VXLAN device isn't stashed between the bridge and a front-panel-port device (such as is the case e.g. for LAG devices), so the usual switchdev ops don't reach the driver. VXLAN is likely not the only device type like this: in theory any L2 tunnel device that needs offloading will prompt requirement of this sort. This falsifies the assumption that only the lower devices of a front panel port need to be notified to achieve flawless offloading. A way to fix this is to give up the notion of port object addition / deletion as a switchdev operation, which assumes somewhat tight coupling between the message producer and consumer. And instead send the message over a notifier chain. To that end, introduce two new switchdev notifier types, SWITCHDEV_PORT_OBJ_ADD and SWITCHDEV_PORT_OBJ_DEL. These notifier types communicate the same event as the corresponding switchdev op, except in a form of a notification. struct switchdev_notifier_port_obj_info was added to carry the fields that the switchdev op carries. An additional field, handled, will be used to communicate back to switchdev that the event has reached an interested party, which will be important for the two-phase commit. The two switchdev operations themselves are kept in place. Following patches first convert individual clients to the notifier protocol, and only then are the operations removed. Signed-off-by: Petr Machata Acked-by: Jiri Pirko Reviewed-by: Ido Schimmel --- include/net/switchdev.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/net/switchdev.h b/include/net/switchdev.h index e021b67b9b32..a2f3ebf39301 100644 --- a/include/net/switchdev.h +++ b/include/net/switchdev.h @@ -146,6 +146,9 @@ enum switchdev_notifier_type { SWITCHDEV_FDB_DEL_TO_DEVICE, SWITCHDEV_FDB_OFFLOADED, + SWITCHDEV_PORT_OBJ_ADD, /* Blocking. */ + SWITCHDEV_PORT_OBJ_DEL, /* Blocking. */ + SWITCHDEV_VXLAN_FDB_ADD_TO_BRIDGE, SWITCHDEV_VXLAN_FDB_DEL_TO_BRIDGE, SWITCHDEV_VXLAN_FDB_ADD_TO_DEVICE, @@ -165,6 +168,13 @@ struct switchdev_notifier_fdb_info { offloaded:1; }; +struct switchdev_notifier_port_obj_info { + struct switchdev_notifier_info info; /* must be first */ + const struct switchdev_obj *obj; + struct switchdev_trans *trans; + bool handled; +}; + static inline struct net_device * switchdev_notifier_info_to_dev(const struct switchdev_notifier_info *info) { -- 2.4.11
[PATCH net-next 00/12] switchdev: Convert switchdev_port_obj_{add,del}() to notifiers
An offloading driver may need to have access to switchdev events on ports that aren't directly under its control. An example is a VXLAN port attached to a bridge offloaded by a driver. The driver needs to know about VLANs configured on the VXLAN device. However the VXLAN device isn't stashed between the bridge and a front-panel-port device (such as is the case e.g. for LAG devices), so the usual switchdev ops don't reach the driver. VXLAN is likely not the only device type like this: in theory any L2 tunnel device that needs offloading will prompt requirement of this sort. A way to fix this is to give up the notion of port object addition / deletion as a switchdev operation, which assumes somewhat tight coupling between the message producer and consumer. And instead send the message over a notifier chain. The series starts with a clean-up patch #1, where SWITCHDEV_OBJ_PORT_{VLAN, MDB}() are fixed up to lift the constraint that the passed-in argument be a simple variable named "obj". switchdev_port_obj_add and _del are invoked in a context that permits blocking. Not only that, at least for the VLAN notification, being able to signal failure is actually important. Therefore introduce a new blocking notifier chain that the new events will be sent on. That's done in patch #2. Retain the current (atomic) notifier chain for the preexisting notifications. In patch #3, introduce two new switchdev notifier types, SWITCHDEV_PORT_OBJ_ADD and SWITCHDEV_PORT_OBJ_DEL. These notifier types communicate the same event as the corresponding switchdev op, except in a form of a notification. struct switchdev_notifier_port_obj_info was added to carry the fields that correspond to the switchdev op arguments. An additional field, handled, will be used to communicate back to switchdev that the event has reached an interested party, which will be important for the two-phase commit. In patches #4, #5, and #7, rocker, DSA resp. ethsw are updated to subscribe to the switchdev blocking notifier chain, and handle the new notifier types. #6 introduces a helper to determine whether a netdevice corresponds to a front panel port. What these three drivers have in common is that their ports don't support any uppers besides bridge. That makes it possible to ignore any notifiers that don't reference a front-panel port device, because they are certainly out of scope. Unlike the previous three, mlxsw and ocelot drivers admit stacked devices as uppers. While the current switchdev code recursively descends through layers of lower devices, eventually calling the op on a front-panel port device, the notifier would reference a stacking device that's one of front-panel ports uppers. The filtering is thus more complex. For ocelot, such iteration is currently pretty much required, because there's no bookkeeping of LAG devices. mlxsw does keep the list of LAGs, however it iterates the lower devices anyway when deciding whether an event on a tunnel device pertains to the driver or not. Therefore this patch set instead introduces, in patch #8, a helper to iterate through lowers, much like the current switchdev code does, looking for devices that match a given predicate. Then in patches #9 and #10, first mlxsw and then ocelot are updated to dispatch the newly-added notifier types to the preexisting port_obj_add/_del handlers. The dispatch is done via the new helper, to recursively descend through lower devices. Finally in patch #11, the actual switch is made, retiring the current SDO-based code in favor of a notifier. Now that the event is distributed through a notifier, the explicit netdevice check in rocker, DSA and ethsw doesn't let through any events except those done on a front-panel port itself. It is therefore unnecessary to check in VLAN-handling code whether a VLAN was added to the bridge itself: such events will simply be ignored much sooner. Therefore remove it in patch #12. Petr Machata (12): switchdev: SWITCHDEV_OBJ_PORT_{VLAN, MDB}(): Sanitize switchdev: Add a blocking notifier chain switchdev: Add SWITCHDEV_PORT_OBJ_ADD, SWITCHDEV_PORT_OBJ_DEL rocker: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL net: dsa: slave: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL staging: fsl-dpaa2: ethsw: Introduce ethsw_port_dev_check() staging: fsl-dpaa2: ethsw: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL switchdev: Add helpers to aid traversal through lower devices mlxsw: spectrum_switchdev: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL ocelot: Handle SWITCHDEV_PORT_OBJ_ADD/_DEL switchdev: Replace port obj add/del SDO with a notification rocker, dsa, ethsw: Don't filter VLAN events on bridge itself .../ethernet/mellanox/mlxsw/spectrum_switchdev.c | 47 - drivers/net/ethernet/mscc/ocelot.c | 30 +++- drivers/net/ethernet/mscc/ocelot.h | 1 + drivers/net/ethernet/mscc/ocelot_board.c | 3 + drivers/net/ethernet/rocker/rocker_main.c | 60 ++- drivers/staging/fsl-dpaa2/ethsw/ethsw.c| 68
Re: [EXT] Re: [PATCH net-next 4/4] octeontx2-af: Bringup CGX LMAC links by default
On Thu Nov 22, 2018 at 07:26:56PM +0100, Andrew Lunn wrote: > External Email > > External Email > > -- > On Thu, Nov 22, 2018 at 05:18:37PM +0530, Linu Cherian wrote: > > From: Linu Cherian > > > > - Added new CGX firmware interface API for sending link up/down > > commands > > > > - Do link up for cgx lmac ports by default at the time of CGX > > driver probe. > > Hi Linu > > This is a complex driver which i don't understand... > > By link up, do you mean the equivalent of 'ip link set up dev ethX'? Not really. It is used to do the necessary LMAC port hardware configuration based on the connected PHYs and bringup the the PHY links. > >Andrew -- Linu cherian
[PATCH net-next v3 4/5] netns: enable to specify a nsid for a get request
Combined with NETNSA_TARGET_NSID, it enables to "translate" a nsid from one netns to a nsid of another netns. This is useful when using NETLINK_F_LISTEN_ALL_NSID because it helps the user to interpret a nsid received from an other netns. Signed-off-by: Nicolas Dichtel Reviewed-by: David Ahern --- net/core/net_namespace.c | 5 + 1 file changed, 5 insertions(+) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 885c54197e31..dd25fb22ad45 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -797,6 +797,11 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, } else if (tb[NETNSA_FD]) { peer = get_net_ns_by_fd(nla_get_u32(tb[NETNSA_FD])); nla = tb[NETNSA_FD]; + } else if (tb[NETNSA_NSID]) { + peer = get_net_ns_by_id(net, nla_get_u32(tb[NETNSA_NSID])); + if (!peer) + peer = ERR_PTR(-ENOENT); + nla = tb[NETNSA_NSID]; } else { NL_SET_ERR_MSG(extack, "Peer netns reference is missing"); return -EINVAL; -- 2.18.0
[PATCH net-next v3 1/5] netns: remove net arg from rtnl_net_fill()
This argument is not used anymore. Fixes: cab3c8ec8d57 ("netns: always provide the id to rtnl_net_fill()") Signed-off-by: Nicolas Dichtel Reviewed-by: David Ahern --- net/core/net_namespace.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index fefe72774aeb..52b9620e3457 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -739,7 +739,7 @@ static int rtnl_net_get_size(void) } static int rtnl_net_fill(struct sk_buff *skb, u32 portid, u32 seq, int flags, -int cmd, struct net *net, int nsid) +int cmd, int nsid) { struct nlmsghdr *nlh; struct rtgenmsg *rth; @@ -801,7 +801,7 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, id = peernet2id(net, peer); err = rtnl_net_fill(msg, NETLINK_CB(skb).portid, nlh->nlmsg_seq, 0, - RTM_NEWNSID, net, id); + RTM_NEWNSID, id); if (err < 0) goto err_out; @@ -816,7 +816,6 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, } struct rtnl_net_dump_cb { - struct net *net; struct sk_buff *skb; struct netlink_callback *cb; int idx; @@ -833,7 +832,7 @@ static int rtnl_net_dumpid_one(int id, void *peer, void *data) ret = rtnl_net_fill(net_cb->skb, NETLINK_CB(net_cb->cb->skb).portid, net_cb->cb->nlh->nlmsg_seq, NLM_F_MULTI, - RTM_NEWNSID, net_cb->net, id); + RTM_NEWNSID, id); if (ret < 0) return ret; @@ -846,7 +845,6 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb) { struct net *net = sock_net(skb->sk); struct rtnl_net_dump_cb net_cb = { - .net = net, .skb = skb, .cb = cb, .idx = 0, @@ -876,7 +874,7 @@ static void rtnl_net_notifyid(struct net *net, int cmd, int id) if (!msg) goto out; - err = rtnl_net_fill(msg, 0, 0, 0, cmd, net, id); + err = rtnl_net_fill(msg, 0, 0, 0, cmd, id); if (err < 0) goto err_out; -- 2.18.0
[PATCH net-next v3 2/5] netns: introduce 'struct net_fill_args'
This is a preparatory work. To avoid having to much arguments for the function rtnl_net_fill(), a new structure is defined. Signed-off-by: Nicolas Dichtel Reviewed-by: David Ahern --- net/core/net_namespace.c | 48 1 file changed, 34 insertions(+), 14 deletions(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 52b9620e3457..f8a5966b086c 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -738,20 +738,28 @@ static int rtnl_net_get_size(void) ; } -static int rtnl_net_fill(struct sk_buff *skb, u32 portid, u32 seq, int flags, -int cmd, int nsid) +struct net_fill_args { + u32 portid; + u32 seq; + int flags; + int cmd; + int nsid; +}; + +static int rtnl_net_fill(struct sk_buff *skb, struct net_fill_args *args) { struct nlmsghdr *nlh; struct rtgenmsg *rth; - nlh = nlmsg_put(skb, portid, seq, cmd, sizeof(*rth), flags); + nlh = nlmsg_put(skb, args->portid, args->seq, args->cmd, sizeof(*rth), + args->flags); if (!nlh) return -EMSGSIZE; rth = nlmsg_data(nlh); rth->rtgen_family = AF_UNSPEC; - if (nla_put_s32(skb, NETNSA_NSID, nsid)) + if (nla_put_s32(skb, NETNSA_NSID, args->nsid)) goto nla_put_failure; nlmsg_end(skb, nlh); @@ -767,10 +775,15 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, { struct net *net = sock_net(skb->sk); struct nlattr *tb[NETNSA_MAX + 1]; + struct net_fill_args fillargs = { + .portid = NETLINK_CB(skb).portid, + .seq = nlh->nlmsg_seq, + .cmd = RTM_NEWNSID, + }; struct nlattr *nla; struct sk_buff *msg; struct net *peer; - int err, id; + int err; err = nlmsg_parse(nlh, sizeof(struct rtgenmsg), tb, NETNSA_MAX, rtnl_net_policy, extack); @@ -799,9 +812,8 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, goto out; } - id = peernet2id(net, peer); - err = rtnl_net_fill(msg, NETLINK_CB(skb).portid, nlh->nlmsg_seq, 0, - RTM_NEWNSID, id); + fillargs.nsid = peernet2id(net, peer); + err = rtnl_net_fill(msg, ); if (err < 0) goto err_out; @@ -817,7 +829,7 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, struct rtnl_net_dump_cb { struct sk_buff *skb; - struct netlink_callback *cb; + struct net_fill_args fillargs; int idx; int s_idx; }; @@ -830,9 +842,8 @@ static int rtnl_net_dumpid_one(int id, void *peer, void *data) if (net_cb->idx < net_cb->s_idx) goto cont; - ret = rtnl_net_fill(net_cb->skb, NETLINK_CB(net_cb->cb->skb).portid, - net_cb->cb->nlh->nlmsg_seq, NLM_F_MULTI, - RTM_NEWNSID, id); + net_cb->fillargs.nsid = id; + ret = rtnl_net_fill(net_cb->skb, _cb->fillargs); if (ret < 0) return ret; @@ -846,7 +857,12 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb) struct net *net = sock_net(skb->sk); struct rtnl_net_dump_cb net_cb = { .skb = skb, - .cb = cb, + .fillargs = { + .portid = NETLINK_CB(cb->skb).portid, + .seq = cb->nlh->nlmsg_seq, + .flags = NLM_F_MULTI, + .cmd = RTM_NEWNSID, + }, .idx = 0, .s_idx = cb->args[0], }; @@ -867,6 +883,10 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb) static void rtnl_net_notifyid(struct net *net, int cmd, int id) { + struct net_fill_args fillargs = { + .cmd = cmd, + .nsid = id, + }; struct sk_buff *msg; int err = -ENOMEM; @@ -874,7 +894,7 @@ static void rtnl_net_notifyid(struct net *net, int cmd, int id) if (!msg) goto out; - err = rtnl_net_fill(msg, 0, 0, 0, cmd, id); + err = rtnl_net_fill(msg, ); if (err < 0) goto err_out; -- 2.18.0
[PATCH net-next v3 5/5] netns: enable to dump full nsid translation table
Like the previous patch, the goal is to ease to convert nsids from one netns to another netns. A new attribute (NETNSA_CURRENT_NSID) is added to the kernel answer when NETNSA_TARGET_NSID is provided, thus the user can easily convert nsids. Signed-off-by: Nicolas Dichtel --- include/uapi/linux/net_namespace.h | 1 + net/core/net_namespace.c | 31 -- 2 files changed, 26 insertions(+), 6 deletions(-) diff --git a/include/uapi/linux/net_namespace.h b/include/uapi/linux/net_namespace.h index 0ed9dd61d32a..9f9956809565 100644 --- a/include/uapi/linux/net_namespace.h +++ b/include/uapi/linux/net_namespace.h @@ -17,6 +17,7 @@ enum { NETNSA_PID, NETNSA_FD, NETNSA_TARGET_NSID, + NETNSA_CURRENT_NSID, __NETNSA_MAX, }; diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index dd25fb22ad45..2f25d7f2a43b 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -736,6 +736,7 @@ static int rtnl_net_get_size(void) { return NLMSG_ALIGN(sizeof(struct rtgenmsg)) + nla_total_size(sizeof(s32)) /* NETNSA_NSID */ + + nla_total_size(sizeof(s32)) /* NETNSA_CURRENT_NSID */ ; } @@ -745,6 +746,8 @@ struct net_fill_args { int flags; int cmd; int nsid; + bool add_ref; + int ref_nsid; }; static int rtnl_net_fill(struct sk_buff *skb, struct net_fill_args *args) @@ -763,6 +766,10 @@ static int rtnl_net_fill(struct sk_buff *skb, struct net_fill_args *args) if (nla_put_s32(skb, NETNSA_NSID, args->nsid)) goto nla_put_failure; + if (args->add_ref && + nla_put_s32(skb, NETNSA_CURRENT_NSID, args->ref_nsid)) + goto nla_put_failure; + nlmsg_end(skb, nlh); return 0; @@ -782,7 +789,6 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, .cmd = RTM_NEWNSID, }; struct net *peer, *target = net; - bool put_target = false; struct nlattr *nla; struct sk_buff *msg; int err; @@ -824,7 +830,8 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, err = PTR_ERR(target); goto out; } - put_target = true; + fillargs.add_ref = true; + fillargs.ref_nsid = peernet2id(net, peer); } msg = nlmsg_new(rtnl_net_get_size(), GFP_KERNEL); @@ -844,7 +851,7 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, err_out: nlmsg_free(msg); out: - if (put_target) + if (fillargs.add_ref) put_net(target); put_net(peer); return err; @@ -852,11 +859,11 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, struct rtnl_net_dump_cb { struct net *tgt_net; + struct net *ref_net; struct sk_buff *skb; struct net_fill_args fillargs; int idx; int s_idx; - bool put_tgt_net; }; static int rtnl_net_dumpid_one(int id, void *peer, void *data) @@ -868,6 +875,8 @@ static int rtnl_net_dumpid_one(int id, void *peer, void *data) goto cont; net_cb->fillargs.nsid = id; + if (net_cb->fillargs.add_ref) + net_cb->fillargs.ref_nsid = __peernet2id(net_cb->ref_net, peer); ret = rtnl_net_fill(net_cb->skb, _cb->fillargs); if (ret < 0) return ret; @@ -904,8 +913,9 @@ static int rtnl_valid_dump_net_req(const struct nlmsghdr *nlh, struct sock *sk, "Invalid target network namespace id"); return PTR_ERR(net); } + net_cb->fillargs.add_ref = true; + net_cb->ref_net = net_cb->tgt_net; net_cb->tgt_net = net; - net_cb->put_tgt_net = true; } else { NL_SET_BAD_ATTR(extack, tb[i]); NL_SET_ERR_MSG(extack, @@ -940,12 +950,21 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb) } spin_lock_bh(_cb.tgt_net->nsid_lock); + if (net_cb.fillargs.add_ref && + !net_eq(net_cb.ref_net, net_cb.tgt_net) && + !spin_trylock_bh(_cb.ref_net->nsid_lock)) { + err = -EAGAIN; + goto end; + } idr_for_each(_cb.tgt_net->netns_ids, rtnl_net_dumpid_one, _cb); + if (net_cb.fillargs.add_ref && + !net_eq(net_cb.ref_net, net_cb.tgt_net)) + spin_unlock_bh(_cb.ref_net->nsid_lock); spin_unlock_bh(_cb.tgt_net->nsid_lock); cb->args[0] = net_cb.idx; end: - if (net_cb.put_tgt_net) + if (net_cb.fillargs.add_ref) put_net(net_cb.tgt_net); return err < 0 ? err : skb->len; } -- 2.18.0
[PATCH net-next v3 0/5] Ease to interpret net-nsid
The goal of this series is to ease the interpretation of nsid received in netlink messages from other netns (when the user uses NETLINK_F_LISTEN_ALL_NSID). After this series, with a patched iproute2: $ ip netns add foo $ ip netns add bar $ touch /var/run/netns/init_net $ mount --bind /proc/1/ns/net /var/run/netns/init_net $ ip netns set init_net 11 $ ip netns set foo 12 $ ip netns set bar 13 $ ip netns init_net (id: 11) bar (id: 13) foo (id: 12) $ ip -n foo netns set init_net 21 $ ip -n foo netns set foo 22 $ ip -n foo netns set bar 23 $ ip -n foo netns init_net (id: 21) bar (id: 23) foo (id: 22) $ ip -n bar netns set init_net 31 $ ip -n bar netns set foo 32 $ ip -n bar netns set bar 33 $ ip -n bar netns init_net (id: 31) bar (id: 33) foo (id: 32) $ ip netns list-id target-nsid 12 nsid 21 current-nsid 11 (iproute2 netns name: init_net) nsid 22 current-nsid 12 (iproute2 netns name: foo) nsid 23 current-nsid 13 (iproute2 netns name: bar) $ ip -n bar netns list-id target-nsid 32 nsid 31 nsid 21 current-nsid 31 (iproute2 netns name: init_net) v2 -> v3: - patch 5/5: account NETNSA_CURRENT_NSID in rtnl_net_get_size() v1 -> v2: - patch 1/5: remove net from struct rtnl_net_dump_cb - patch 2/5: new in this version - patch 3/5: use a bool to know if rtnl_get_net_ns_capable() was called - patch 5/5: use struct net_fill_args include/uapi/linux/net_namespace.h | 2 + net/core/net_namespace.c | 158 +++-- 2 files changed, 134 insertions(+), 26 deletions(-) Comments are welcomed, Regards, Nicolas
[PATCH net-next v3 3/5] netns: add support of NETNSA_TARGET_NSID
Like it was done for link and address, add the ability to perform get/dump in another netns by specifying a target nsid attribute. Signed-off-by: Nicolas Dichtel Reviewed-by: David Ahern --- include/uapi/linux/net_namespace.h | 1 + net/core/net_namespace.c | 86 ++ 2 files changed, 76 insertions(+), 11 deletions(-) diff --git a/include/uapi/linux/net_namespace.h b/include/uapi/linux/net_namespace.h index 0187c74d8889..0ed9dd61d32a 100644 --- a/include/uapi/linux/net_namespace.h +++ b/include/uapi/linux/net_namespace.h @@ -16,6 +16,7 @@ enum { NETNSA_NSID, NETNSA_PID, NETNSA_FD, + NETNSA_TARGET_NSID, __NETNSA_MAX, }; diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index f8a5966b086c..885c54197e31 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -669,6 +669,7 @@ static const struct nla_policy rtnl_net_policy[NETNSA_MAX + 1] = { [NETNSA_NSID] = { .type = NLA_S32 }, [NETNSA_PID]= { .type = NLA_U32 }, [NETNSA_FD] = { .type = NLA_U32 }, + [NETNSA_TARGET_NSID]= { .type = NLA_S32 }, }; static int rtnl_net_newid(struct sk_buff *skb, struct nlmsghdr *nlh, @@ -780,9 +781,10 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, .seq = nlh->nlmsg_seq, .cmd = RTM_NEWNSID, }; + struct net *peer, *target = net; + bool put_target = false; struct nlattr *nla; struct sk_buff *msg; - struct net *peer; int err; err = nlmsg_parse(nlh, sizeof(struct rtgenmsg), tb, NETNSA_MAX, @@ -806,13 +808,27 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, return PTR_ERR(peer); } + if (tb[NETNSA_TARGET_NSID]) { + int id = nla_get_s32(tb[NETNSA_TARGET_NSID]); + + target = rtnl_get_net_ns_capable(NETLINK_CB(skb).sk, id); + if (IS_ERR(target)) { + NL_SET_BAD_ATTR(extack, tb[NETNSA_TARGET_NSID]); + NL_SET_ERR_MSG(extack, + "Target netns reference is invalid"); + err = PTR_ERR(target); + goto out; + } + put_target = true; + } + msg = nlmsg_new(rtnl_net_get_size(), GFP_KERNEL); if (!msg) { err = -ENOMEM; goto out; } - fillargs.nsid = peernet2id(net, peer); + fillargs.nsid = peernet2id(target, peer); err = rtnl_net_fill(msg, ); if (err < 0) goto err_out; @@ -823,15 +839,19 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, err_out: nlmsg_free(msg); out: + if (put_target) + put_net(target); put_net(peer); return err; } struct rtnl_net_dump_cb { + struct net *tgt_net; struct sk_buff *skb; struct net_fill_args fillargs; int idx; int s_idx; + bool put_tgt_net; }; static int rtnl_net_dumpid_one(int id, void *peer, void *data) @@ -852,10 +872,50 @@ static int rtnl_net_dumpid_one(int id, void *peer, void *data) return 0; } +static int rtnl_valid_dump_net_req(const struct nlmsghdr *nlh, struct sock *sk, + struct rtnl_net_dump_cb *net_cb, + struct netlink_callback *cb) +{ + struct netlink_ext_ack *extack = cb->extack; + struct nlattr *tb[NETNSA_MAX + 1]; + int err, i; + + err = nlmsg_parse_strict(nlh, sizeof(struct rtgenmsg), tb, NETNSA_MAX, +rtnl_net_policy, extack); + if (err < 0) + return err; + + for (i = 0; i <= NETNSA_MAX; i++) { + if (!tb[i]) + continue; + + if (i == NETNSA_TARGET_NSID) { + struct net *net; + + net = rtnl_get_net_ns_capable(sk, nla_get_s32(tb[i])); + if (IS_ERR(net)) { + NL_SET_BAD_ATTR(extack, tb[i]); + NL_SET_ERR_MSG(extack, + "Invalid target network namespace id"); + return PTR_ERR(net); + } + net_cb->tgt_net = net; + net_cb->put_tgt_net = true; + } else { + NL_SET_BAD_ATTR(extack, tb[i]); + NL_SET_ERR_MSG(extack, + "Unsupported attribute in dump request"); + return -EINVAL; + } + } + + return 0; +} + static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb) { - struct net *net = sock_net(skb->sk); struct
[PATCH v2] samples: bpf: fix: error handling regarding kprobe_events
Currently, kprobe_events failure won't be handled properly. Due to calling system() indirectly to write to kprobe_events, it can't be identified whether an error is derived from kprobe or system. // buf = "echo '%c:%s %s' >> /s/k/d/t/kprobe_events" err = system(buf); if (err < 0) { printf("failed to create kprobe .."); return -1; } For example, running ./tracex7 sample in ext4 partition, "echo p:open_ctree open_ctree >> /s/k/d/t/kprobe_events" gets 256 error code system() failure. => The error comes from kprobe, but it's not handled correctly. According to man of system(3), it's return value just passes the termination status of the child shell rather than treating the error as -1. (don't care success) Which means, currently it's not working as desired. (According to the upper code snippet) ex) running ./tracex7 with ext4 env. # Current Output sh: echo: I/O error failed to open event open_ctree # Desired Output failed to create kprobe 'open_ctree' error 'No such file or directory' The problem is, error can't be verified whether from child ps or system. But using write() directly can verify the command failure, and it will treat all error as -1. So I suggest using write() directly to 'kprobe_events' rather than calling system(). Signed-off-by: Daniel T. Lee --- Changes in v2: - Fix code style at variable declaration. samples/bpf/bpf_load.c | 33 - 1 file changed, 24 insertions(+), 9 deletions(-) diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c index e6d7e0fe155b..96783207de4a 100644 --- a/samples/bpf/bpf_load.c +++ b/samples/bpf/bpf_load.c @@ -54,6 +54,23 @@ static int populate_prog_array(const char *event, int prog_fd) return 0; } +static int write_kprobe_events(const char *val) +{ + int fd, ret, flags; + + if ((val != NULL) && (val[0] == '\0')) + flags = O_WRONLY | O_TRUNC; + else + flags = O_WRONLY | O_APPEND; + + fd = open("/sys/kernel/debug/tracing/kprobe_events", flags); + + ret = write(fd, val, strlen(val)); + close(fd); + + return ret; +} + static int load_and_attach(const char *event, struct bpf_insn *prog, int size) { bool is_socket = strncmp(event, "socket", 6) == 0; @@ -165,10 +182,9 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) #ifdef __x86_64__ if (strncmp(event, "sys_", 4) == 0) { - snprintf(buf, sizeof(buf), -"echo '%c:__x64_%s __x64_%s' >> /sys/kernel/debug/tracing/kprobe_events", -is_kprobe ? 'p' : 'r', event, event); - err = system(buf); + snprintf(buf, sizeof(buf), "%c:__x64_%s __x64_%s", + is_kprobe ? 'p' : 'r', event, event); + err = write_kprobe_events(buf); if (err >= 0) { need_normal_check = false; event_prefix = "__x64_"; @@ -176,10 +192,9 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size) } #endif if (need_normal_check) { - snprintf(buf, sizeof(buf), -"echo '%c:%s %s' >> /sys/kernel/debug/tracing/kprobe_events", -is_kprobe ? 'p' : 'r', event, event); - err = system(buf); + snprintf(buf, sizeof(buf), "%c:%s %s", + is_kprobe ? 'p' : 'r', event, event); + err = write_kprobe_events(buf); if (err < 0) { printf("failed to create kprobe '%s' error '%s'\n", event, strerror(errno)); @@ -519,7 +534,7 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map) return 1; /* clear all kprobes */ - i = system("echo \"\" > /sys/kernel/debug/tracing/kprobe_events"); + i = write_kprobe_events(""); /* scan over all elf sections to get license and map info */ for (i = 1; i < ehdr.e_shnum; i++) { -- 2.17.1
Re: [RFC PATCH bpf-next] libbpf: make bpf_object__open default to UNSPEC
[ +Wang ] On 11/22/2018 07:03 AM, Nikita V. Shirokov wrote: > currently by default libbpf's bpf_object__open requires > bpf's program to specify version in a code because of two things: > 1) default prog type is set to KPROBE > 2) KPROBE requires (in kernel/bpf/syscall.c) version to be specified > > in this RFC i'm proposing change default to UNSPEC and also changing > logic of libbpf that it would reflect what we have today in kernel > (aka only KPROBE type requires for version to be explicitly set). > > reason for change: > currently only libbpf requires by default version to be > explicitly set. it would be really hard for mainteiners of other custom > bpf loaders to migrate to libbpf (as they dont control user's code > and migration to the new loader (libbpf) wont be transparent for end > user). > > what is going to be broken after this change: > if someone were relying on default to be KPROBE for bpf_object__open > his code will stop to work. however i'm really doubtfull that anyone > is using this for kprobe type of programs (instead of, say, bcc or > other tracing frameworks) > > other possible solutions (for discussion, would require more machinery): > add another function like bpf_object__open w/ default to unspec > > Signed-off-by: Nikita V. Shirokov > --- > tools/lib/bpf/libbpf.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c > index 0f14f7c074c2..ed4212a4c5f9 100644 > --- a/tools/lib/bpf/libbpf.c > +++ b/tools/lib/bpf/libbpf.c > @@ -333,7 +333,7 @@ bpf_program__init(void *data, size_t size, char > *section_name, int idx, > prog->idx = idx; > prog->instances.fds = NULL; > prog->instances.nr = -1; > - prog->type = BPF_PROG_TYPE_KPROBE; > + prog->type = BPF_PROG_TYPE_UNSPEC; > prog->btf_fd = -1; Seems this was mostly for historic reasons, but for a generic library this would indeed be an odd convention for default. Wang, given 5f44e4c810bf ("tools lib bpf: New API to adjust type of a BPF program"), are you in any way relying on this default or using things like bpf_program__set_kprobe() instead which you've added there? If latter, I'd say we should then change it better now than later when there's even more lib usage (and in particular before we add official ABI versioning). > return 0; > @@ -1649,12 +1649,12 @@ static bool bpf_prog_type__needs_kver(enum > bpf_prog_type type) > case BPF_PROG_TYPE_LIRC_MODE2: > case BPF_PROG_TYPE_SK_REUSEPORT: > case BPF_PROG_TYPE_FLOW_DISSECTOR: > - return false; > case BPF_PROG_TYPE_UNSPEC: > - case BPF_PROG_TYPE_KPROBE: > case BPF_PROG_TYPE_TRACEPOINT: > - case BPF_PROG_TYPE_PERF_EVENT: > case BPF_PROG_TYPE_RAW_TRACEPOINT: > + case BPF_PROG_TYPE_PERF_EVENT: > + return false; > + case BPF_PROG_TYPE_KPROBE: > default: > return true; > } > Thanks, Daniel
Re: [PATCH] bpf: fix check of allowed specifiers in bpf_trace_printk
Hi Martynas, On 11/22/2018 05:00 PM, Martynas Pumputis wrote: > A format string consisting of "%p" or "%s" followed by an invalid > specifier (e.g. "%p%\n" or "%s%") could pass the check which > would make format_decode (lib/vsprintf.c) to warn. > > Reported-by: syzbot+1ec5c5ec949c4adaa...@syzkaller.appspotmail.com > Signed-off-by: Martynas Pumputis > --- > kernel/trace/bpf_trace.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index 08fcfe440c63..9ab05736e1a1 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -225,6 +225,8 @@ BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, > u64, arg1, > (void *) (long) unsafe_addr, > sizeof(buf)); > } > + if (fmt[i] == '%') > + i--; > continue; > } Thanks for the fix! Could we simplify the logic a bit to avoid having to navigate i back and forth which got us in trouble in the first place? Like below (untested) perhaps? diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 08fcfe4..ff83b8c 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -196,11 +196,13 @@ BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, u64, arg1, i++; } else if (fmt[i] == 'p' || fmt[i] == 's') { mod[fmt_cnt]++; - i++; - if (!isspace(fmt[i]) && !ispunct(fmt[i]) && fmt[i] != 0) + /* Disallow any further format extensions. */ + if (fmt[i + 1] != 0 && + !isspace(fmt[i + 1]) && + !ispunct(fmt[i + 1])) return -EINVAL; fmt_cnt++; - if (fmt[i - 1] == 's') { + if (fmt[i] == 's') { if (str_seen) /* allow only one '%s' per fmt string */ return -EINVAL; Thanks, Daniel
Re: [PATCH net-next,v3 00/12] add flow_rule infrastructure
On Thu, Nov 22, 2018 at 02:22:20PM -0200, Marcelo Ricardo Leitner wrote: > On Wed, Nov 21, 2018 at 03:51:20AM +0100, Pablo Neira Ayuso wrote: > > Hi, > > > > This patchset is the third iteration [1] [2] [3] to introduce a kernel > > intermediate (IR) to express ACL hardware offloads. > > On v2 cover letter you had: > > """ > However, cost of this layer is very small, adding 1 million rules via > tc -batch, perf shows: > > 0.06% tc [kernel.vmlinux][k] tc_setup_flow_action > """ > > The above doesn't include time spent on children calls and I'm worried > about the new allocation done by flow_rule_alloc(), as it can impact > rule insertion rate. I'll run some tests here and report back. I'm seeing +60ms on 1.75s (~3.4%) to add 40k flower rules on ingress with skip_hw and tc in batch mode, with flows like: filter add dev p6p2 parent : protocol ip prio 1 flower skip_hw src_mac ec:13:db:00:00:00 dst_mac ec:14:c2:00:00:00 src_ip 56.0.0.0 dst_ip 55.0.0.0 action drop Only 20ms out of those 60ms were consumed within fl_change() calls (considering children calls), though. Do you see something similar? I used current net-next (d59da3fbfe3f) and with this patchset applied.
[PATCH net-next 3/5] r8169: simplify detecting chip versions with same XID
For the GMII chip versions we set the version number which was set already. This can be simplified. Signed-off-by: Heiner Kallweit --- drivers/net/ethernet/realtek/r8169.c | 19 +++ 1 file changed, 7 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 1e549b26b..9a696455e 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -2116,18 +2116,13 @@ static void rtl8169_get_mac_version(struct rtl8169_private *tp) if (tp->mac_version == RTL_GIGA_MAC_NONE) { dev_err(tp_to_dev(tp), "unknown chip XID %03x\n", reg & 0xfcf); - } else if (tp->mac_version == RTL_GIGA_MAC_VER_42) { - tp->mac_version = tp->supports_gmii ? - RTL_GIGA_MAC_VER_42 : - RTL_GIGA_MAC_VER_43; - } else if (tp->mac_version == RTL_GIGA_MAC_VER_45) { - tp->mac_version = tp->supports_gmii ? - RTL_GIGA_MAC_VER_45 : - RTL_GIGA_MAC_VER_47; - } else if (tp->mac_version == RTL_GIGA_MAC_VER_46) { - tp->mac_version = tp->supports_gmii ? - RTL_GIGA_MAC_VER_46 : - RTL_GIGA_MAC_VER_48; + } else if (!tp->supports_gmii) { + if (tp->mac_version == RTL_GIGA_MAC_VER_42) + tp->mac_version = RTL_GIGA_MAC_VER_43; + else if (tp->mac_version == RTL_GIGA_MAC_VER_45) + tp->mac_version = RTL_GIGA_MAC_VER_47; + else if (tp->mac_version == RTL_GIGA_MAC_VER_46) + tp->mac_version = RTL_GIGA_MAC_VER_48; } } -- 2.19.1
[PATCH net-next 4/5] r8169: use napi_consume_skb where possible
Use napi_consume_skb() where possible to profit from bulk free infrastructure. Signed-off-by: Heiner Kallweit --- drivers/net/ethernet/realtek/r8169.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 9a696455e..2ca0b2ed9 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -6204,7 +6204,8 @@ static void rtl8169_pcierr_interrupt(struct net_device *dev) rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING); } -static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp) +static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp, + int budget) { unsigned int dirty_tx, tx_left, bytes_compl = 0, pkts_compl = 0; @@ -6232,7 +6233,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp) if (status & LastFrag) { pkts_compl++; bytes_compl += tx_skb->skb->len; - dev_consume_skb_any(tx_skb->skb); + napi_consume_skb(tx_skb->skb, budget); tx_skb->skb = NULL; } dirty_tx++; @@ -6475,7 +6476,7 @@ static int rtl8169_poll(struct napi_struct *napi, int budget) work_done = rtl_rx(dev, tp, (u32) budget); - rtl_tx(dev, tp); + rtl_tx(dev, tp, budget); if (work_done < budget) { napi_complete_done(napi, work_done); -- 2.19.1
[PATCH net-next 5/5] r8169: replace macro TX_FRAGS_READY_FOR with a function
Replace macro TX_FRAGS_READY_FOR with function rtl_tx_slots_avail to make code cleaner and type-safe. Signed-off-by: Heiner Kallweit --- drivers/net/ethernet/realtek/r8169.c | 24 +--- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 2ca0b2ed9..f768b966e 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -56,13 +56,6 @@ #define R8169_MSG_DEFAULT \ (NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_IFUP | NETIF_MSG_IFDOWN) -#define TX_SLOTS_AVAIL(tp) \ - (tp->dirty_tx + NUM_TX_DESC - tp->cur_tx) - -/* A skbuff with nr_frags needs nr_frags+1 entries in the tx queue */ -#define TX_FRAGS_READY_FOR(tp,nr_frags) \ - (TX_SLOTS_AVAIL(tp) >= (nr_frags + 1)) - /* Maximum number of multicast addresses to filter (vs. Rx-all-multicast). The RTL chips use a 64 element hash table based on the Ethernet CRC. */ static const int multicast_filter_limit = 32; @@ -6058,6 +6051,15 @@ static bool rtl8169_tso_csum_v2(struct rtl8169_private *tp, return true; } +static bool rtl_tx_slots_avail(struct rtl8169_private *tp, + unsigned int nr_frags) +{ + unsigned int slots_avail = tp->dirty_tx + NUM_TX_DESC - tp->cur_tx; + + /* A skbuff with nr_frags needs nr_frags+1 entries in the tx queue */ + return slots_avail > nr_frags; +} + static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, struct net_device *dev) { @@ -6069,7 +6071,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, u32 opts[2], len; int frags; - if (unlikely(!TX_FRAGS_READY_FOR(tp, skb_shinfo(skb)->nr_frags))) { + if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) { netif_err(tp, drv, dev, "BUG! Tx Ring full when queue awake!\n"); goto err_stop_0; } @@ -6126,7 +6128,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, mmiowb(); - if (!TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) { + if (!rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) { /* Avoid wrongly optimistic queue wake-up: rtl_tx thread must * not miss a ring update when it notices a stopped queue. */ @@ -6140,7 +6142,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, * can't. */ smp_mb(); - if (TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) + if (rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) netif_wake_queue(dev); } @@ -6258,7 +6260,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp, */ smp_mb(); if (netif_queue_stopped(dev) && - TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) { + rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) { netif_wake_queue(dev); } /* -- 2.19.1
[PATCH net-next 2/5] r8169: remove default chip versions
Even the chip versions within a family have so many differences that using a default chip version doesn't really make sense. Instead of leaving a best case flaky network connectivity, bail out and report the unknown chip version. Signed-off-by: Heiner Kallweit --- drivers/net/ethernet/realtek/r8169.c | 15 +-- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index bef89ba50..1e549b26b 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -2011,8 +2011,7 @@ static const struct ethtool_ops rtl8169_ethtool_ops = { .set_link_ksettings = phy_ethtool_set_link_ksettings, }; -static void rtl8169_get_mac_version(struct rtl8169_private *tp, - u8 default_version) +static void rtl8169_get_mac_version(struct rtl8169_private *tp) { /* * The driver currently handles the 8168Bf and the 8168Be identically @@ -2116,9 +2115,7 @@ static void rtl8169_get_mac_version(struct rtl8169_private *tp, tp->mac_version = p->mac_version; if (tp->mac_version == RTL_GIGA_MAC_NONE) { - dev_notice(tp_to_dev(tp), - "unknown MAC, using family default\n"); - tp->mac_version = default_version; + dev_err(tp_to_dev(tp), "unknown chip XID %03x\n", reg & 0xfcf); } else if (tp->mac_version == RTL_GIGA_MAC_VER_42) { tp->mac_version = tp->supports_gmii ? RTL_GIGA_MAC_VER_42 : @@ -6976,27 +6973,23 @@ static const struct rtl_cfg_info { u16 irq_mask; unsigned int has_gmii:1; const struct rtl_coalesce_info *coalesce_info; - u8 default_ver; } rtl_cfg_infos [] = { [RTL_CFG_0] = { .hw_start = rtl_hw_start_8169, .irq_mask = SYSErr | LinkChg | RxOverflow | RxFIFOOver, .has_gmii = 1, .coalesce_info = rtl_coalesce_info_8169, - .default_ver= RTL_GIGA_MAC_VER_01, }, [RTL_CFG_1] = { .hw_start = rtl_hw_start_8168, .irq_mask = LinkChg | RxOverflow, .has_gmii = 1, .coalesce_info = rtl_coalesce_info_8168_8136, - .default_ver= RTL_GIGA_MAC_VER_11, }, [RTL_CFG_2] = { .hw_start = rtl_hw_start_8101, .irq_mask = LinkChg | RxOverflow | RxFIFOOver, .coalesce_info = rtl_coalesce_info_8168_8136, - .default_ver= RTL_GIGA_MAC_VER_13, } }; @@ -7259,7 +7252,9 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) tp->mmio_addr = pcim_iomap_table(pdev)[region]; /* Identify chip attached to board */ - rtl8169_get_mac_version(tp, cfg->default_ver); + rtl8169_get_mac_version(tp); + if (tp->mac_version == RTL_GIGA_MAC_NONE) + return -ENODEV; if (rtl_tbi_enabled(tp)) { dev_err(>dev, "TBI fiber mode not supported\n"); -- 2.19.1
[PATCH net-next 1/5] r8169: remove ancient GCC bug workaround in a second place
Remove ancient GCC bug workaround in a second place and factor out rtl_8169_get_txd_opts1. Signed-off-by: Heiner Kallweit --- drivers/net/ethernet/realtek/r8169.c | 25 ++--- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index f5781285a..bef89ba50 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -5840,6 +5840,16 @@ static void rtl8169_tx_timeout(struct net_device *dev) rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING); } +static __le32 rtl8169_get_txd_opts1(u32 opts0, u32 len, unsigned int entry) +{ + u32 status = opts0 | len; + + if (entry == NUM_TX_DESC - 1) + status |= RingEnd; + + return cpu_to_le32(status); +} + static int rtl8169_xmit_frags(struct rtl8169_private *tp, struct sk_buff *skb, u32 *opts) { @@ -5852,7 +5862,7 @@ static int rtl8169_xmit_frags(struct rtl8169_private *tp, struct sk_buff *skb, for (cur_frag = 0; cur_frag < info->nr_frags; cur_frag++) { const skb_frag_t *frag = info->frags + cur_frag; dma_addr_t mapping; - u32 status, len; + u32 len; void *addr; entry = (entry + 1) % NUM_TX_DESC; @@ -5868,11 +5878,7 @@ static int rtl8169_xmit_frags(struct rtl8169_private *tp, struct sk_buff *skb, goto err_out; } - status = opts[0] | len; - if (entry == NUM_TX_DESC - 1) - status |= RingEnd; - - txd->opts1 = cpu_to_le32(status); + txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry); txd->opts2 = cpu_to_le32(opts[1]); txd->addr = cpu_to_le64(mapping); @@ -6068,8 +6074,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, struct TxDesc *txd = tp->TxDescArray + entry; struct device *d = tp_to_dev(tp); dma_addr_t mapping; - u32 status, len; - u32 opts[2]; + u32 opts[2], len; int frags; if (unlikely(!TX_FRAGS_READY_FOR(tp, skb_shinfo(skb)->nr_frags))) { @@ -6118,9 +6123,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, /* Force memory writes to complete before releasing descriptor */ dma_wmb(); - /* Anti gcc 2.95.3 bugware (sic) */ - status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC)); - txd->opts1 = cpu_to_le32(status); + txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry); /* Force all memory writes to complete before notifying device */ wmb(); -- 2.19.1
[PATCH bpf-next] bpf: Add BPF_MAP_TYPE_QUEUE and BPF_MAP_TYPE_QUEUE to bpftool-map
I noticed that these two new BPF Maps are not defined in bpftool. This patch defines those two maps and adds their names to the bpftool-map documentation. Signed-off-by: David Calavera --- tools/bpf/bpftool/Documentation/bpftool-map.rst | 3 ++- tools/bpf/bpftool/map.c | 2 ++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst index f55a2daed59b..9e827e342d9e 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-map.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst @@ -42,7 +42,8 @@ MAP COMMANDS | | **percpu_array** | **stack_trace** | **cgroup_array** | **lru_hash** | | **lru_percpu_hash** | **lpm_trie** | **array_of_maps** | **hash_of_maps** | | **devmap** | **sockmap** | **cpumap** | **xskmap** | **sockhash** -| | **cgroup_storage** | **reuseport_sockarray** | **percpu_cgroup_storage** } +| | **cgroup_storage** | **reuseport_sockarray** | **percpu_cgroup_storage** +| | **queue** | **stack** } DESCRIPTION === diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c index 7bf38f0e152e..68b656b6edcc 100644 --- a/tools/bpf/bpftool/map.c +++ b/tools/bpf/bpftool/map.c @@ -74,6 +74,8 @@ static const char * const map_type_name[] = { [BPF_MAP_TYPE_CGROUP_STORAGE] = "cgroup_storage", [BPF_MAP_TYPE_REUSEPORT_SOCKARRAY] = "reuseport_sockarray", [BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE]= "percpu_cgroup_storage", + [BPF_MAP_TYPE_QUEUE] = "queue", + [BPF_MAP_TYPE_STACK] = "stack", }; static bool map_is_per_cpu(__u32 type) -- 2.17.1
[PATCH net-next 0/5] r8169: some functional improvements
This series includes a few functional improvements. Heiner Kallweit (5): r8169: Remove ancient GCC bug workaround in a second place r8169: remove default chip versions r8169: simplify detecting chip versions with same XID r8169: use napi_consume_skb where possible r8169: replace macro TX_FRAGS_READY_FOR with a function drivers/net/ethernet/realtek/r8169.c | 90 +--- 1 file changed, 43 insertions(+), 47 deletions(-) -- 2.19.1
Re: [PATCH bpf] bpf: fix integer overflow in queue_stack_map
On 11/22/2018 07:49 PM, Alexei Starovoitov wrote: > fix the following issues: > - allow queue_stack_map for root only > - fix u32 max_entries overflow > - disallow value_size == 0 > > Reported-by: Wei Wu > Fixes: f1a2e44a3aec ("bpf: add queue and stack maps") > Signed-off-by: Alexei Starovoitov Applied, thanks everyone!
Re: DSA support for Marvell 88e6065 switch
Hi! > > > > If I wanted it to work, what do I need to do? AFAICT phy autoprobing > > > > should just attach it as soon as it is compiled in? > > > > > > Nope. It is a switch, not a PHY. Switches are never auto-probed > > > because they are not guaranteed to have ID registers. > > > > > > You need to use the legacy device tree binding. Look in > > > Documentation/devicetree/bindings/net/dsa/dsa.txt, section Deprecated > > > Binding. You can get more examples if you checkout old kernels. Or > > > kirkwood-rd88f6281.dtsi, the dsa { } node which is disabled. > > > > Thanks; I ported code from mv88e66xx in the meantime, and switch > > appears to be detected. > > > > But I'm running into problems with tagging code, and I guess I'd like > > some help understanding. > > > > tag_trailer: allocates new skb, then copies data around. > > > > tag_qca: does dev->stats.tx_packets++, and reuses existing skb. > > > > tag_brcm: reuses existing skb. Any idea why tag trailer allocates new skb, and what is going on with dev->stats.tx_packets++? > > Is qca wrong in adjusting the statistics? Why does trailer allocate > > new skb? > > > > 6065 seems to use 2-byte header between "SFD" and "Destination > > address" in the ethernet frame. That's ... strange place to put > > header, as addresses are now shifted. I need to put ethernet in > > promisc mode (by running tcpdump) to get data moving.. and can not > > figure out what to do in tag_... > > Does this switch chip not also support trailer mode? > > There's basically four tagging modes for Marvell switch chips: header > mode (the one you described), trailer mode (tag_trailer.c), DSA and > ethertype DSA. The switch chips I worked on that didn't support > (ethertype) DSA tagging did support both header and trailer modes, > and I chose to run them in trailer mode for the reasons you describe > above, but if your chip doesn't support trailer mode, then yes, > you'll have to add support for header mode and put the underlying > interface into promiscuous mode and such. It seems that 6060 supports both header (probably, parts of docs are redacted) and trailer mode... but I'm working with 6065. That does not support trailer mode... or at least word "trailer" does not appear anywhere in the documentation. What chip were you working with? I may want to take a look on their wording. 6065 indeed has some kind of "egress tagging mode" (with four options), but I have trouble understanding what it really does. Thanks, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: [PATCH net] net: thunderx: set xdp_prog to NULL if bpf_prog_add fails
From: Lorenzo Bianconi Date: Wed, 21 Nov 2018 16:32:10 +0100 > Set xdp_prog pointer to NULL if bpf_prog_add fails since that routine > reports the error code instead of NULL in case of failure and xdp_prog > pointer value is used in the driver to verify if XDP is currently > enabled. > Moreover report the error code to userspace if nicvf_xdp_setup fails > > Fixes: 05c773f52b96 ("net: thunderx: Add basic XDP support") > Signed-off-by: Lorenzo Bianconi Applied and queued up for -stable.
[PATCH v2 bpf-next] bpf: add skb->tstamp r/w access from tc clsact and cg skb progs
This could be used to rate limit egress traffic in concert with a qdisc which supports Earliest Departure Time, such as FQ. Write access from cg skb progs only with CAP_SYS_ADMIN, since the value will be used by downstream qdiscs. It might make sense to relax this. Changes v1 -> v2: - allow access from cg skb, write only with CAP_SYS_ADMIN Signed-off-by: Vlad Dumitrescu --- include/uapi/linux/bpf.h| 1 + net/core/filter.c | 29 + tools/include/uapi/linux/bpf.h | 1 + tools/testing/selftests/bpf/test_verifier.c | 29 + 4 files changed, 60 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c1554aa074659..23e2031a43d43 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2468,6 +2468,7 @@ struct __sk_buff { __u32 data_meta; struct bpf_flow_keys *flow_keys; + __u64 tstamp; }; struct bpf_tunnel_key { diff --git a/net/core/filter.c b/net/core/filter.c index f6ca38a7d4332..65dc13aeca7c4 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5573,6 +5573,10 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type if (size != sizeof(struct bpf_flow_keys *)) return false; break; + case bpf_ctx_range(struct __sk_buff, tstamp): + if (size != sizeof(__u64)) + return false; + break; default: /* Only narrow read access allowed for now. */ if (type == BPF_WRITE) { @@ -5600,6 +5604,7 @@ static bool sk_filter_is_valid_access(int off, int size, case bpf_ctx_range(struct __sk_buff, data_end): case bpf_ctx_range(struct __sk_buff, flow_keys): case bpf_ctx_range_till(struct __sk_buff, family, local_port): + case bpf_ctx_range(struct __sk_buff, tstamp): return false; } @@ -5638,6 +5643,10 @@ static bool cg_skb_is_valid_access(int off, int size, case bpf_ctx_range(struct __sk_buff, priority): case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]): break; + case bpf_ctx_range(struct __sk_buff, tstamp): + if (!capable(CAP_SYS_ADMIN)) + return false; + break; default: return false; } @@ -5665,6 +5674,7 @@ static bool lwt_is_valid_access(int off, int size, case bpf_ctx_range_till(struct __sk_buff, family, local_port): case bpf_ctx_range(struct __sk_buff, data_meta): case bpf_ctx_range(struct __sk_buff, flow_keys): + case bpf_ctx_range(struct __sk_buff, tstamp): return false; } @@ -5874,6 +5884,7 @@ static bool tc_cls_act_is_valid_access(int off, int size, case bpf_ctx_range(struct __sk_buff, priority): case bpf_ctx_range(struct __sk_buff, tc_classid): case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]): + case bpf_ctx_range(struct __sk_buff, tstamp): break; default: return false; @@ -6093,6 +6104,7 @@ static bool sk_skb_is_valid_access(int off, int size, case bpf_ctx_range(struct __sk_buff, tc_classid): case bpf_ctx_range(struct __sk_buff, data_meta): case bpf_ctx_range(struct __sk_buff, flow_keys): + case bpf_ctx_range(struct __sk_buff, tstamp): return false; } @@ -6179,6 +6191,7 @@ static bool flow_dissector_is_valid_access(int off, int size, case bpf_ctx_range(struct __sk_buff, tc_classid): case bpf_ctx_range(struct __sk_buff, data_meta): case bpf_ctx_range_till(struct __sk_buff, family, local_port): + case bpf_ctx_range(struct __sk_buff, tstamp): return false; } @@ -6488,6 +6501,22 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type, *insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg, off); break; + + case offsetof(struct __sk_buff, tstamp): + BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, tstamp) != 8); + + if (type == BPF_WRITE) + *insn++ = BPF_STX_MEM(BPF_DW, + si->dst_reg, si->src_reg, + bpf_target_off(struct sk_buff, +tstamp, 8, +target_size)); + else + *insn++ = BPF_LDX_MEM(BPF_DW, + si->dst_reg, si->src_reg, + bpf_target_off(struct
Re: [PATCH net-next] {net,IB}/mlx4: Initialize CQ buffers in the driver when possible
From: Tariq Toukan Date: Wed, 21 Nov 2018 17:12:05 +0200 > From: Daniel Jurgens > > Perform CQ initialization in the driver when the capability is supported > by the FW. When passing the CQ to HW indicate that the CQ buffer has > been pre-initialized. > > Doing so decreases CQ creation time. Testing on P8 showed a single 2048 > entry CQ creation time was reduced from ~395us to ~170us, which is > 2.3x faster. > > Signed-off-by: Daniel Jurgens > Signed-off-by: Jack Morgenstein > Signed-off-by: Tariq Toukan Applied.
Re: [PATCH net] net/dim: Update DIM start sample after each DIM iteration
From: Tal Gilboa Date: Wed, 21 Nov 2018 16:28:23 +0200 > On every iteration of net_dim, the algorithm may choose to > check for the system state by comparing current data sample > with previous data sample. After each of these comparison, > regardless of the action taken, the sample used as baseline > is needed to be updated. > > This patch fixes a bug that causes DIM to take wrong decisions, > due to never updating the baseline sample for comparison between > iterations. This way, DIM always compares current sample with > zeros. > > Although this is a functional fix, it also improves and stabilizes > performance as the algorithm works properly now. > > Performance: > Tested single UDP TX stream with pktgen: > samples/pktgen/pktgen_sample03_burst_single_flow.sh -i p4p2 -d 1.1.1.1 > -m 24:8a:07:88:26:8b -f 3 -b 128 > > ConnectX-5 100GbE packet rate improved from 15-19Mpps to 19-20Mpps. > Also, toggling between profiles is less frequent with the fix. > > Fixes: 8115b750dbcb ("net/dim: use struct net_dim_sample as arg to net_dim") > Signed-off-by: Tal Gilboa > Reviewed-by: Tariq Toukan Applied and queued up for -stable.
Re: [PATCH net-next] selftests: explicitly require kernel features needed by udpgro tests
From: Paolo Abeni Date: Wed, 21 Nov 2018 14:31:15 +0100 > commit 3327a9c46352f1 ("selftests: add functionals test for UDP GRO") > make use of ipv6 NAT, but such a feature is not currently implied by > selftests. Since the 'ip[6]tables' commands may actually create nft rules, > depending on the specific user-space version, let's pull both NF and > NFT nat modules plus the needed deps. > > Reported-by: Naresh Kamboju > Fixes: 3327a9c46352f1 ("selftests: add functionals test for UDP GRO") > Signed-off-by: Paolo Abeni Applied.
[PATCH bpf] bpf: fix integer overflow in queue_stack_map
fix the following issues: - allow queue_stack_map for root only - fix u32 max_entries overflow - disallow value_size == 0 Reported-by: Wei Wu Fixes: f1a2e44a3aec ("bpf: add queue and stack maps") Signed-off-by: Alexei Starovoitov --- kernel/bpf/queue_stack_maps.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c index 8bbd72d3a121..b384ea9f3254 100644 --- a/kernel/bpf/queue_stack_maps.c +++ b/kernel/bpf/queue_stack_maps.c @@ -7,6 +7,7 @@ #include #include #include +#include #include "percpu_freelist.h" #define QUEUE_STACK_CREATE_FLAG_MASK \ @@ -45,8 +46,12 @@ static bool queue_stack_map_is_full(struct bpf_queue_stack *qs) /* Called from syscall */ static int queue_stack_map_alloc_check(union bpf_attr *attr) { + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + /* check sanity of attributes */ if (attr->max_entries == 0 || attr->key_size != 0 || + attr->value_size == 0 || attr->map_flags & ~QUEUE_STACK_CREATE_FLAG_MASK) return -EINVAL; @@ -63,15 +68,10 @@ static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr) { int ret, numa_node = bpf_map_attr_numa_node(attr); struct bpf_queue_stack *qs; - u32 size, value_size; - u64 queue_size, cost; - - size = attr->max_entries + 1; - value_size = attr->value_size; - - queue_size = sizeof(*qs) + (u64) value_size * size; + u64 size, queue_size, cost; - cost = queue_size; + size = (u64) attr->max_entries + 1; + cost = queue_size = sizeof(*qs) + size * attr->value_size; if (cost >= U32_MAX - PAGE_SIZE) return ERR_PTR(-E2BIG); -- 2.17.1
Re: [PATCH v3 net-next 04/12] net: ethernet: Use phy_set_max_speed() to limit advertised speed
On Thu, Nov 22, 2018 at 12:40:25PM +0200, Anssi Hannula wrote: > Hi, > > On 12.9.2018 2:53, Andrew Lunn wrote: > > Many Ethernet MAC drivers want to limit the PHY to only advertise a > > maximum speed of 100Mbs or 1Gbps. Rather than using a mask, make use > > of the helper function phy_set_max_speed(). > > But what if the PHY does not support 1Gbps in the first place? Yes, you are correct. __set_phy_supported() needs modifying to take into account what the PHY can do. Thanks for pointing this out. I will take a look. Andrew
Re: [PATCH net-next 4/4] octeontx2-af: Bringup CGX LMAC links by default
On Thu, Nov 22, 2018 at 05:18:37PM +0530, Linu Cherian wrote: > From: Linu Cherian > > - Added new CGX firmware interface API for sending link up/down > commands > > - Do link up for cgx lmac ports by default at the time of CGX > driver probe. Hi Linu This is a complex driver which i don't understand... By link up, do you mean the equivalent of 'ip link set up dev ethX'? Andrew
Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
On Thu, Nov 22, 2018 at 10:16 AM Eric Dumazet wrote: > Yes, I was considering properly filtering SACK as a refinement later [1] > but you raise a valid point for alien stacks that are not yet using SACK :/ > > [1] This version of the patch will not aggregate sacks since the > memcmp() on tcp options would fail. > > Neal can you double check if cake_ack_filter() does not have the issue > you just mentioned ? Note that aggregated pure acks will have a gso_segs set to the number of aggregated acks, we might simply use this value later in the stack, instead of forcing having X pure acks in the backlog and increase memory needs and cpu costs. Then I guess I need this fix : diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 36c9d715bf2aa7eb7bf58b045bfeb85a2ec1a696..736f7f24cdb4fe61769faaa1644c8bff01c746c4 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1669,7 +1669,8 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb) __skb_pull(skb, hdrlen); if (skb_try_coalesce(tail, skb, , )) { TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq; - TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq; + if (after(TCP_SKB_CB(skb)->ack_seq, TCP_SKB_CB(tail)->ack_seq)) + TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq; TCP_SKB_CB(tail)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags; if (TCP_SKB_CB(skb)->has_rxtstamp) {
Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
On Thu, Nov 22, 2018 at 10:01 AM Neal Cardwell wrote: > > On Wed, Nov 21, 2018 at 12:52 PM Eric Dumazet wrote: > > > > In case GRO is not as efficient as it should be or disabled, > > we might have a user thread trapped in __release_sock() while > > softirq handler flood packets up to the point we have to drop. > > > > This patch balances work done from user thread and softirq, > > to give more chances to __release_sock() to complete its work. > > > > This also helps if we receive many ACK packets, since GRO > > does not aggregate them. > > Would this coalesce duplicate incoming ACK packets? Is there a risk > that this would eliminate incoming dupacks needed for fast recovery in > non-SACK connections? Perhaps pure ACKs should only be coalesced if > the ACK field is different? Yes, I was considering properly filtering SACK as a refinement later [1] but you raise a valid point for alien stacks that are not yet using SACK :/ [1] This version of the patch will not aggregate sacks since the memcmp() on tcp options would fail. Neal can you double check if cake_ack_filter() does not have the issue you just mentioned ?
Re: [PATCH v1 net] lan743x: Enable driver to work with LAN7431
On Wed, Nov 21, 2018 at 02:22:45PM -0500, Bryan Whitehead wrote: > This driver was designed to work with both LAN7430 and LAN7431. > The only difference between the two is the LAN7431 has support > for external phy. > > This change adds LAN7431 to the list of recognized devices > supported by this driver. > > fixes: driver won't load for LAN7431 Hi Bryan There is a well defined format for Fixes:. Fixes: 23f0703c125b ("lan743x: Add main source files for new lan743x driver") Andrew
Re: [RFC v4 4/5] netdev: add netdev_is_upper_master
Le 22 nov. 2018 à 18:14, David Ahern a écrit : > On 11/21/18 6:07 PM, Alexis Bauvin wrote: >> diff --git a/net/core/dev.c b/net/core/dev.c >> index 93243479085f..12459036d0da 100644 >> --- a/net/core/dev.c >> +++ b/net/core/dev.c >> @@ -7225,6 +7225,23 @@ void netdev_lower_state_changed(struct net_device >> *lower_dev, >> } >> EXPORT_SYMBOL(netdev_lower_state_changed); >> >> +/** >> + * netdev_is_upper_master - Test if a device is a master, direct or >> indirect, >> + * of another one. >> + * @dev: device to start looking from >> + * @master: device to test if master of dev >> + */ >> +bool netdev_is_upper_master(struct net_device *dev, struct net_device >> *master) >> +{ >> +if (!dev) >> +return false; >> + >> +if (dev->ifindex == master->ifindex) > > dev == master should work as well without the dereference. Ack, will add to next version. >> +return true; >> +return netdev_is_upper_master(netdev_master_upper_dev_get(dev), master); >> +} >> +EXPORT_SYMBOL(netdev_is_upper_master); >> + >> static void dev_change_rx_flags(struct net_device *dev, int flags) >> { >> const struct net_device_ops *ops = dev->netdev_ops; >>
Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
On Wed, Nov 21, 2018 at 12:52 PM Eric Dumazet wrote: > > In case GRO is not as efficient as it should be or disabled, > we might have a user thread trapped in __release_sock() while > softirq handler flood packets up to the point we have to drop. > > This patch balances work done from user thread and softirq, > to give more chances to __release_sock() to complete its work. > > This also helps if we receive many ACK packets, since GRO > does not aggregate them. Would this coalesce duplicate incoming ACK packets? Is there a risk that this would eliminate incoming dupacks needed for fast recovery in non-SACK connections? Perhaps pure ACKs should only be coalesced if the ACK field is different? neal
Re: [RFC v4 3/5] vxlan: add support for underlay in non-default VRF
On 11/21/18 6:07 PM, Alexis Bauvin wrote: > Creating a VXLAN device with is underlay in the non-default VRF makes > egress route lookup fail or incorrect since it will resolve in the > default VRF, and ingress fail because the socket listens in the default > VRF. > > This patch binds the underlying UDP tunnel socket to the l3mdev of the > lower device of the VXLAN device. This will listen in the proper VRF and > output traffic from said l3mdev, matching l3mdev routing rules and > looking up the correct routing table. > > When the VXLAN device does not have a lower device, or the lower device > is in the default VRF, the socket will not be bound to any interface, > keeping the previous behaviour. > > The underlay l3mdev is deduced from the VXLAN lower device > (IFLA_VXLAN_LINK). > > +--+ +-+ > | | | | > | vrf-blue | | vrf-red | > | | | | > ++-+ +++ > || > || > ++-+ +++ > | | | | > | br-blue | | br-red | > | | | | > ++-+ +---+-+---+ > | | | > | +-+ +-+ > | | | > ++-++--++ +++ > | | lower device | | | | > | eth0 | <- - - - - - - | vxlan-red | | tap-red | (... more taps) > | || | | | > +--++---+ +-+ > > Signed-off-by: Alexis Bauvin > Reviewed-by: Amine Kherbouche > Tested-by: Amine Kherbouche > --- > drivers/net/vxlan.c | 32 +-- > .../selftests/net/test_vxlan_under_vrf.sh | 90 +++ > 2 files changed, 114 insertions(+), 8 deletions(-) > create mode 100755 tools/testing/selftests/net/test_vxlan_under_vrf.sh > Reviewed-by: David Ahern Thanks for adding the test case; I'll try it out next week (after the holidays).
Re: [RFC v4 4/5] netdev: add netdev_is_upper_master
On 11/21/18 6:07 PM, Alexis Bauvin wrote: > diff --git a/net/core/dev.c b/net/core/dev.c > index 93243479085f..12459036d0da 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -7225,6 +7225,23 @@ void netdev_lower_state_changed(struct net_device > *lower_dev, > } > EXPORT_SYMBOL(netdev_lower_state_changed); > > +/** > + * netdev_is_upper_master - Test if a device is a master, direct or indirect, > + * of another one. > + * @dev: device to start looking from > + * @master: device to test if master of dev > + */ > +bool netdev_is_upper_master(struct net_device *dev, struct net_device > *master) > +{ > + if (!dev) > + return false; > + > + if (dev->ifindex == master->ifindex) dev == master should work as well without the dereference. > + return true; > + return netdev_is_upper_master(netdev_master_upper_dev_get(dev), master); > +} > +EXPORT_SYMBOL(netdev_is_upper_master); > + > static void dev_change_rx_flags(struct net_device *dev, int flags) > { > const struct net_device_ops *ops = dev->netdev_ops; >
Re: [RFC v4 2/5] l3mdev: add function to retreive upper master
On 11/21/18 6:07 PM, Alexis Bauvin wrote: > Existing functions to retreive the l3mdev of a device did not walk the > master chain to find the upper master. This patch adds a function to > find the l3mdev, even indirect through e.g. a bridge: > > ... > > This will properly resolve the l3mdev of eth0 to vrf-blue. > > Signed-off-by: Alexis Bauvin > Reviewed-by: Amine Kherbouche > Tested-by: Amine Kherbouche > --- > include/net/l3mdev.h | 22 ++ > net/l3mdev/l3mdev.c | 18 ++ > 2 files changed, 40 insertions(+) > Reviewed-by: David Ahern
Re: [RFC v4 1/5] udp_tunnel: add config option to bind to a device
On 11/21/18 6:07 PM, Alexis Bauvin wrote: > UDP tunnel sockets are always opened unbound to a specific device. This > patch allow the socket to be bound on a custom device, which > incidentally makes UDP tunnels VRF-aware if binding to an l3mdev. > > Signed-off-by: Alexis Bauvin > Reviewed-by: Amine Kherbouche > Tested-by: Amine Kherbouche > --- > include/net/udp_tunnel.h | 1 + > net/ipv4/udp_tunnel.c | 10 ++ > net/ipv6/ip6_udp_tunnel.c | 9 + > 3 files changed, 20 insertions(+) Reviewed-by: David Ahern
patchwork bug?
Not sure if it's the right place to post that. When I try to list patches with filters, something like this: http://patchwork.ozlabs.org/project/netdev/list/?series==2036=*==both=34 I can see only page 1. When I click on '2', the page 1 is still displayed and the page numerotation is removed. Regards, Nicolas
Re: consistency for statistics with XDP mode
David Ahern writes: > On 11/22/18 1:26 AM, Toke Høiland-Jørgensen wrote: >> Saeed Mahameed writes: >> > I'd say it sounds reasonable to include XDP in the normal traffic > counters, but having the detailed XDP-specific counters is quite > useful > as well... So can't we do both (for all drivers)? > >>> >>> What are you thinking ? >>> reporting XDP_DROP in interface dropped counter ? >>> and XDP_TX/REDIRECT in the TX counter ? >>> XDP_ABORTED in the err/drop counter ? >>> >>> how about having a special XDP command in the .ndo_bpf that would query >>> the standardized XDP stats ? >> >> Don't have any strong opinions on the mechanism; just pointing out that >> the XDP-specific stats are useful to have separately as well :) >> > > I would like to see basic packets, bytes, and dropped counters tracked > for Rx and Tx via the standard netdev counters for all devices. This is > for ease in accounting as well as speed and simplicity for bumping > counters for virtual devices from bpf helpers. > > From there, the XDP ones can be in the driver private stats as they are > currently but with some consistency across drivers for redirects, drops, > any thing else. > > So not a radical departure from where we are today, just getting the > agreement for consistency and driver owners to make the changes. Sounds good to me :) -Toke
Re: [PATCH net-next,v3 12/12] qede: use ethtool_rx_flow_rule() to remove duplicated parser code
On Wed, Nov 21, 2018 at 03:51:32AM +0100, Pablo Neira Ayuso wrote: ... > static int > qede_parse_flower_attr(struct qede_dev *edev, __be16 proto, > -struct tc_cls_flower_offload *f, > -struct qede_arfs_tuple *tuple) > +struct flow_rule *rule, struct qede_arfs_tuple *tuple) What about s/qede_parse_flower_attr/qede_parse_flow_attr/ or so? As it is not about flower anymore. It also helps here: > -static int qede_flow_spec_to_tuple(struct qede_dev *edev, > -struct qede_arfs_tuple *t, > -struct ethtool_rx_flow_spec *fs) > +static int qede_flow_spec_to_rule(struct qede_dev *edev, > + struct qede_arfs_tuple *t, > + struct ethtool_rx_flow_spec *fs) > { ... > + > + if (qede_parse_flower_attr(edev, proto, flow->rule, t)) { > + err = -EINVAL; > + goto err_out; > + } > + > + /* Make sure location is valid and filter isn't already set */ > + err = qede_flow_spec_validate(edev, >rule->action, t, > + fs->location); ...
Re: consistency for statistics with XDP mode
On 11/22/18 1:26 AM, Toke Høiland-Jørgensen wrote: > Saeed Mahameed writes: > I'd say it sounds reasonable to include XDP in the normal traffic counters, but having the detailed XDP-specific counters is quite useful as well... So can't we do both (for all drivers)? >> >> What are you thinking ? >> reporting XDP_DROP in interface dropped counter ? >> and XDP_TX/REDIRECT in the TX counter ? >> XDP_ABORTED in the err/drop counter ? >> >> how about having a special XDP command in the .ndo_bpf that would query >> the standardized XDP stats ? > > Don't have any strong opinions on the mechanism; just pointing out that > the XDP-specific stats are useful to have separately as well :) > I would like to see basic packets, bytes, and dropped counters tracked for Rx and Tx via the standard netdev counters for all devices. This is for ease in accounting as well as speed and simplicity for bumping counters for virtual devices from bpf helpers. >From there, the XDP ones can be in the driver private stats as they are currently but with some consistency across drivers for redirects, drops, any thing else. So not a radical departure from where we are today, just getting the agreement for consistency and driver owners to make the changes.
Re: consistency for statistics with XDP mode
On 11/21/18 5:53 PM, Toshiaki Makita wrote: >> We really need consistency in the counters and at a minimum, users >> should be able to track packet and byte counters for both Rx and Tx >> including XDP. >> >> It seems to me the Rx and Tx packet, byte and dropped counters returned >> for the standard device stats (/proc/net/dev, ip -s li show, ...) should >> include all packets managed by the driver regardless of whether they are >> forwarded / dropped in XDP or go up the Linux stack. This also aligns > > Agreed. When I introduced virtio_net XDP counters, I just forgot to > update tx packets/bytes counters on ndo_xdp_xmit. Probably I thought it > is handled by free_old_xmit_skbs. Do you have some time to look at adding the Tx counters to virtio_net?
Re: [PATCH net-next v2 5/5] netns: enable to dump full nsid translation table
Le 22/11/2018 à 17:40, David Ahern a écrit : > On 11/22/18 8:50 AM, Nicolas Dichtel wrote: >> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c >> index dd25fb22ad45..25030e0317a2 100644 >> --- a/net/core/net_namespace.c >> +++ b/net/core/net_namespace.c >> @@ -745,6 +745,8 @@ struct net_fill_args { >> int flags; >> int cmd; >> int nsid; >> +bool add_ref; >> +int ref_nsid; >> }; >> >> static int rtnl_net_fill(struct sk_buff *skb, struct net_fill_args *args) >> @@ -763,6 +765,10 @@ static int rtnl_net_fill(struct sk_buff *skb, struct >> net_fill_args *args) >> if (nla_put_s32(skb, NETNSA_NSID, args->nsid)) >> goto nla_put_failure; >> >> +if (args->add_ref && >> +nla_put_s32(skb, NETNSA_CURRENT_NSID, args->ref_nsid)) >> +goto nla_put_failure; >> + > > you need to add NETNSA_CURRENT_NSID to rtnl_net_get_size. > Good catch. I thought to this and I forgot at the end :/
Re: [PATCH net-next v2 5/5] netns: enable to dump full nsid translation table
On 11/22/18 8:50 AM, Nicolas Dichtel wrote: > diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c > index dd25fb22ad45..25030e0317a2 100644 > --- a/net/core/net_namespace.c > +++ b/net/core/net_namespace.c > @@ -745,6 +745,8 @@ struct net_fill_args { > int flags; > int cmd; > int nsid; > + bool add_ref; > + int ref_nsid; > }; > > static int rtnl_net_fill(struct sk_buff *skb, struct net_fill_args *args) > @@ -763,6 +765,10 @@ static int rtnl_net_fill(struct sk_buff *skb, struct > net_fill_args *args) > if (nla_put_s32(skb, NETNSA_NSID, args->nsid)) > goto nla_put_failure; > > + if (args->add_ref && > + nla_put_s32(skb, NETNSA_CURRENT_NSID, args->ref_nsid)) > + goto nla_put_failure; > + you need to add NETNSA_CURRENT_NSID to rtnl_net_get_size.
[PATCH bpf] bpf: Fix integer overflow in queue_stack_map_alloc.
Integer overflow in queue_stack_map_alloc when calculating size may lead to heap overflow of arbitrary length. The patch fix it by checking whether attr->max_entries+1 < attr->max_entries and bailing out if it is the case. The vulnerability is discovered with the assistance of syzkaller. Reported-by: Wei Wu To: Alexei Starovoitov Cc: Daniel Borkmann Cc: netdev Cc: Eric Dumazet Cc: Greg KH Signed-off-by: Wei Wu --- kernel/bpf/queue_stack_maps.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c index 8bbd72d3a121..c35a8a4721c8 100644 --- a/kernel/bpf/queue_stack_maps.c +++ b/kernel/bpf/queue_stack_maps.c @@ -67,6 +67,8 @@ static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr) u64 queue_size, cost; size = attr->max_entries + 1; + if (size < attr->max_entries) + return ERR_PTR(-EINVAL); value_size = attr->value_size; queue_size = sizeof(*qs) + (u64) value_size * size; -- 2.17.1
Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
On Wed, Nov 21, 2018 at 2:40 PM, Eric Dumazet wrote: > > > On 11/21/2018 02:31 PM, Yuchung Cheng wrote: >> On Wed, Nov 21, 2018 at 9:52 AM, Eric Dumazet wrote: > >>> + >> Really nice! would it make sense to re-use (some of) the similar >> tcp_try_coalesce()? >> > > Maybe, but it is a bit complex, since skbs in receive queues (regular or out > of order) > are accounted differently (they have skb->destructor set) > > Also they had the TCP header pulled already, while the backlog coalescing > also has > to make sure TCP options match. > > Not sure if we want to add extra parameters and conditional checks... Makes sense. Acked-by: Yuchung Cheng > >
Re: [PATCH net-next,v3 04/12] cls_api: add translator to flow_action representation
On Wed, Nov 21, 2018 at 03:51:24AM +0100, Pablo Neira Ayuso wrote: ... > +int tc_setup_flow_action(struct flow_action *flow_action, > + const struct tcf_exts *exts) > +{ > + const struct tc_action *act; > + int i, j, k; > + > + if (!exts) > + return 0; > + > + j = 0; > + tcf_exts_for_each_action(i, act, exts) { > + struct flow_action_entry *key; ^ ^^^ > + > + key = _action->entries[j]; ^^^ ^^^ Considering previous changes, what about a s/key/entry/ in the variable name here too? > + if (is_tcf_gact_ok(act)) { > + key->id = FLOW_ACTION_ACCEPT; > + } else if (is_tcf_gact_shot(act)) { > + key->id = FLOW_ACTION_DROP; > + } else if (is_tcf_gact_trap(act)) { > + key->id = FLOW_ACTION_TRAP; > + } else if (is_tcf_gact_goto_chain(act)) { > + key->id = FLOW_ACTION_GOTO; > + key->chain_index = tcf_gact_goto_chain_index(act); > + } else if (is_tcf_mirred_egress_redirect(act)) { > + key->id = FLOW_ACTION_REDIRECT; > + key->dev = tcf_mirred_dev(act); > + } else if (is_tcf_mirred_egress_mirror(act)) { > + key->id = FLOW_ACTION_MIRRED; > + key->dev = tcf_mirred_dev(act); > + } else if (is_tcf_vlan(act)) { > + switch (tcf_vlan_action(act)) { > + case TCA_VLAN_ACT_PUSH: > + key->id = FLOW_ACTION_VLAN_PUSH; > + key->vlan.vid = tcf_vlan_push_vid(act); > + key->vlan.proto = tcf_vlan_push_proto(act); > + key->vlan.prio = tcf_vlan_push_prio(act); > + break; > + case TCA_VLAN_ACT_POP: > + key->id = FLOW_ACTION_VLAN_POP; > + break; > + case TCA_VLAN_ACT_MODIFY: > + key->id = FLOW_ACTION_VLAN_MANGLE; > + key->vlan.vid = tcf_vlan_push_vid(act); > + key->vlan.proto = tcf_vlan_push_proto(act); > + key->vlan.prio = tcf_vlan_push_prio(act); > + break; > + default: > + goto err_out; > + } > + } else if (is_tcf_tunnel_set(act)) { > + key->id = FLOW_ACTION_TUNNEL_ENCAP; > + key->tunnel = tcf_tunnel_info(act); > + } else if (is_tcf_tunnel_release(act)) { > + key->id = FLOW_ACTION_TUNNEL_DECAP; > + key->tunnel = tcf_tunnel_info(act); > + } else if (is_tcf_pedit(act)) { > + for (k = 0; k < tcf_pedit_nkeys(act); k++) { > + switch (tcf_pedit_cmd(act, k)) { > + case TCA_PEDIT_KEY_EX_CMD_SET: > + key->id = FLOW_ACTION_MANGLE; > + break; > + case TCA_PEDIT_KEY_EX_CMD_ADD: > + key->id = FLOW_ACTION_ADD; > + break; > + default: > + goto err_out; > + } > + key->mangle.htype = tcf_pedit_htype(act, k); > + key->mangle.mask = tcf_pedit_mask(act, k); > + key->mangle.val = tcf_pedit_val(act, k); > + key->mangle.offset = tcf_pedit_offset(act, k); > + key = _action->entries[++j]; > + } > + } else if (is_tcf_csum(act)) { > + key->id = FLOW_ACTION_CSUM; > + key->csum_flags = tcf_csum_update_flags(act); > + } else if (is_tcf_skbedit_mark(act)) { > + key->id = FLOW_ACTION_MARK; > + key->mark = tcf_skbedit_mark(act); > + } else { > + goto err_out; > + } > + > + if (!is_tcf_pedit(act)) > + j++; > + } > + return 0; > +err_out: > + return -EOPNOTSUPP; > +} > +EXPORT_SYMBOL(tc_setup_flow_action);
Re: [PATCH net-next v2 3/5] netns: add support of NETNSA_TARGET_NSID
On 11/22/18 8:50 AM, Nicolas Dichtel wrote: > Like it was done for link and address, add the ability to perform get/dump > in another netns by specifying a target nsid attribute. > > Signed-off-by: Nicolas Dichtel > --- > include/uapi/linux/net_namespace.h | 1 + > net/core/net_namespace.c | 86 ++ > 2 files changed, 76 insertions(+), 11 deletions(-) Reviewed-by: David Ahern
Re: [PATCH net-next v2 4/5] netns: enable to specify a nsid for a get request
On 11/22/18 8:50 AM, Nicolas Dichtel wrote: > Combined with NETNSA_TARGET_NSID, it enables to "translate" a nsid from one > netns to a nsid of another netns. > This is useful when using NETLINK_F_LISTEN_ALL_NSID because it helps the > user to interpret a nsid received from an other netns. > > Signed-off-by: Nicolas Dichtel > --- > net/core/net_namespace.c | 5 + > 1 file changed, 5 insertions(+) > Reviewed-by: David Ahern
Re: [PATCH net-next v2 2/5] netns: introduce 'struct net_fill_args'
On 11/22/18 8:50 AM, Nicolas Dichtel wrote: > This is a preparatory work. To avoid having to much arguments for the > function rtnl_net_fill(), a new structure is defined. > > Signed-off-by: Nicolas Dichtel > --- > net/core/net_namespace.c | 48 > 1 file changed, 34 insertions(+), 14 deletions(-) > Reviewed-by: David Ahern
Re: [PATCH net-next,v3 00/12] add flow_rule infrastructure
On Wed, Nov 21, 2018 at 03:51:20AM +0100, Pablo Neira Ayuso wrote: > Hi, > > This patchset is the third iteration [1] [2] [3] to introduce a kernel > intermediate (IR) to express ACL hardware offloads. On v2 cover letter you had: """ However, cost of this layer is very small, adding 1 million rules via tc -batch, perf shows: 0.06% tc [kernel.vmlinux][k] tc_setup_flow_action """ The above doesn't include time spent on children calls and I'm worried about the new allocation done by flow_rule_alloc(), as it can impact rule insertion rate. I'll run some tests here and report back.
Re: [PATCH net-next v2 1/5] netns: remove net arg from rtnl_net_fill()
On 11/22/18 8:50 AM, Nicolas Dichtel wrote: > This argument is not used anymore. > > Fixes: cab3c8ec8d57 ("netns: always provide the id to rtnl_net_fill()") > Signed-off-by: Nicolas Dichtel > --- > net/core/net_namespace.c | 10 -- > 1 file changed, 4 insertions(+), 6 deletions(-) > Reviewed-by: David Ahern
Re: [PATCH bpf] bpf: Fix integer overflow in queue_stack_map_alloc.
On Thu, Nov 22, 2018 at 11:59:02PM +0800, Wei Wu wrote: > Integer overflow in queue_stack_map_alloc when calculating size may > lead to heap overflow of arbitrary length. > The patch fix it by checking whether attr->max_entries+1 < > attr->max_entries and bailing out if it is the case. > The vulnerability is discovered with the assistance of syzkaller. > > Reported-by: Wei Wu > To: Alexei Starovoitov > Cc: Daniel Borkmann > Cc: netdev > Cc: Eric Dumazet > Cc: Greg KH > Signed-off-by: Wei Wu > --- > kernel/bpf/queue_stack_maps.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c > index 8bbd72d3a121..c35a8a4721c8 100644 > --- a/kernel/bpf/queue_stack_maps.c > +++ b/kernel/bpf/queue_stack_maps.c > @@ -67,6 +67,8 @@ static struct bpf_map *queue_stack_map_alloc(union > bpf_attr *attr) > u64 queue_size, cost; > > size = attr->max_entries + 1; > + if (size < attr->max_entries) > + return ERR_PTR(-EINVAL); > value_size = attr->value_size; Your tabs got eaten by your email client and they all disappeared, making the patch impossible to apply :( Care to try again? thanks, greg k-h
Re: [PATCH net-next 2/2] net: bridge: add no_linklocal_learn bool option
> int br_boolopt_get(const struct net_bridge *br, enum br_boolopt_id opt) > { > - int optval = 0; > - > switch (opt) { > + case BR_BOOLOPT_NO_LL_LEARN: > + return br_opt_get(br, BROPT_NO_LL_LEARN); > default: > break; > } > > - return optval; > + return 0; > } It seems like 1/2 of that change belongs in the previous patch. > --- a/net/bridge/br_sysfs_br.c > +++ b/net/bridge/br_sysfs_br.c > @@ -328,6 +328,27 @@ static ssize_t flush_store(struct device *d, > } > static DEVICE_ATTR_WO(flush); > > +static ssize_t no_linklocal_learn_show(struct device *d, > +struct device_attribute *attr, > +char *buf) > +{ > + struct net_bridge *br = to_bridge(d); > + return sprintf(buf, "%d\n", br_boolopt_get(br, BR_BOOLOPT_NO_LL_LEARN)); > +} > + > +static int set_no_linklocal_learn(struct net_bridge *br, unsigned long val) > +{ > + return br_boolopt_toggle(br, BR_BOOLOPT_NO_LL_LEARN, !!val); > +} > + > +static ssize_t no_linklocal_learn_store(struct device *d, > + struct device_attribute *attr, > + const char *buf, size_t len) > +{ > + return store_bridge_parm(d, buf, len, set_no_linklocal_learn); > +} > +static DEVICE_ATTR_RW(no_linklocal_learn); I thought we where trying to move away from sysfs? Do we need to add new options here? It seems like forcing people to use iproute2 for newer options is a good way to get people to convert to iproute2. Andrew
[PATCH] bpf: fix check of allowed specifiers in bpf_trace_printk
A format string consisting of "%p" or "%s" followed by an invalid specifier (e.g. "%p%\n" or "%s%") could pass the check which would make format_decode (lib/vsprintf.c) to warn. Reported-by: syzbot+1ec5c5ec949c4adaa...@syzkaller.appspotmail.com Signed-off-by: Martynas Pumputis --- kernel/trace/bpf_trace.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 08fcfe440c63..9ab05736e1a1 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -225,6 +225,8 @@ BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, u64, arg1, (void *) (long) unsafe_addr, sizeof(buf)); } + if (fmt[i] == '%') + i--; continue; } -- 2.19.1
[PATCH bpf] bpf: Fix integer overflow in queue_stack_map_alloc.
Integer overflow in queue_stack_map_alloc when calculating size may lead to heap overflow of arbitrary length. The patch fix it by checking whether attr->max_entries+1 < attr->max_entries and bailing out if it is the case. The vulnerability is discovered with the assistance of syzkaller. Reported-by: Wei Wu To: Alexei Starovoitov Cc: Daniel Borkmann Cc: netdev Cc: Eric Dumazet Cc: Greg KH Signed-off-by: Wei Wu --- kernel/bpf/queue_stack_maps.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c index 8bbd72d3a121..c35a8a4721c8 100644 --- a/kernel/bpf/queue_stack_maps.c +++ b/kernel/bpf/queue_stack_maps.c @@ -67,6 +67,8 @@ static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr) u64 queue_size, cost; size = attr->max_entries + 1; + if (size < attr->max_entries) + return ERR_PTR(-EINVAL); value_size = attr->value_size; queue_size = sizeof(*qs) + (u64) value_size * size; -- 2.17.1
[PATCH net-next v2 0/5] Ease to interpret net-nsid
The goal of this series is to ease the interpretation of nsid received in netlink messages from other netns (when the user uses NETLINK_F_LISTEN_ALL_NSID). After this series, with a patched iproute2: $ ip netns add foo $ ip netns add bar $ touch /var/run/netns/init_net $ mount --bind /proc/1/ns/net /var/run/netns/init_net $ ip netns set init_net 11 $ ip netns set foo 12 $ ip netns set bar 13 $ ip netns init_net (id: 11) bar (id: 13) foo (id: 12) $ ip -n foo netns set init_net 21 $ ip -n foo netns set foo 22 $ ip -n foo netns set bar 23 $ ip -n foo netns init_net (id: 21) bar (id: 23) foo (id: 22) $ ip -n bar netns set init_net 31 $ ip -n bar netns set foo 32 $ ip -n bar netns set bar 33 $ ip -n bar netns init_net (id: 31) bar (id: 33) foo (id: 32) $ ip netns list-id target-nsid 12 nsid 21 current-nsid 11 (iproute2 netns name: init_net) nsid 22 current-nsid 12 (iproute2 netns name: foo) nsid 23 current-nsid 13 (iproute2 netns name: bar) $ ip -n bar netns list-id target-nsid 32 nsid 31 nsid 21 current-nsid 31 (iproute2 netns name: init_net) v1 -> v2: - patch 1/5: remove net from struct rtnl_net_dump_cb - patch 2/5: new in this version - patch 3/5: use a bool to know if rtnl_get_net_ns_capable() was called - patch 5/5: use struct net_fill_args include/uapi/linux/net_namespace.h | 2 + net/core/net_namespace.c | 157 +++-- 2 files changed, 133 insertions(+), 26 deletions(-) Comments are welcomed, Regards, Nicolas
[PATCH net-next v2 3/5] netns: add support of NETNSA_TARGET_NSID
Like it was done for link and address, add the ability to perform get/dump in another netns by specifying a target nsid attribute. Signed-off-by: Nicolas Dichtel --- include/uapi/linux/net_namespace.h | 1 + net/core/net_namespace.c | 86 ++ 2 files changed, 76 insertions(+), 11 deletions(-) diff --git a/include/uapi/linux/net_namespace.h b/include/uapi/linux/net_namespace.h index 0187c74d8889..0ed9dd61d32a 100644 --- a/include/uapi/linux/net_namespace.h +++ b/include/uapi/linux/net_namespace.h @@ -16,6 +16,7 @@ enum { NETNSA_NSID, NETNSA_PID, NETNSA_FD, + NETNSA_TARGET_NSID, __NETNSA_MAX, }; diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index f8a5966b086c..885c54197e31 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -669,6 +669,7 @@ static const struct nla_policy rtnl_net_policy[NETNSA_MAX + 1] = { [NETNSA_NSID] = { .type = NLA_S32 }, [NETNSA_PID]= { .type = NLA_U32 }, [NETNSA_FD] = { .type = NLA_U32 }, + [NETNSA_TARGET_NSID]= { .type = NLA_S32 }, }; static int rtnl_net_newid(struct sk_buff *skb, struct nlmsghdr *nlh, @@ -780,9 +781,10 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, .seq = nlh->nlmsg_seq, .cmd = RTM_NEWNSID, }; + struct net *peer, *target = net; + bool put_target = false; struct nlattr *nla; struct sk_buff *msg; - struct net *peer; int err; err = nlmsg_parse(nlh, sizeof(struct rtgenmsg), tb, NETNSA_MAX, @@ -806,13 +808,27 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, return PTR_ERR(peer); } + if (tb[NETNSA_TARGET_NSID]) { + int id = nla_get_s32(tb[NETNSA_TARGET_NSID]); + + target = rtnl_get_net_ns_capable(NETLINK_CB(skb).sk, id); + if (IS_ERR(target)) { + NL_SET_BAD_ATTR(extack, tb[NETNSA_TARGET_NSID]); + NL_SET_ERR_MSG(extack, + "Target netns reference is invalid"); + err = PTR_ERR(target); + goto out; + } + put_target = true; + } + msg = nlmsg_new(rtnl_net_get_size(), GFP_KERNEL); if (!msg) { err = -ENOMEM; goto out; } - fillargs.nsid = peernet2id(net, peer); + fillargs.nsid = peernet2id(target, peer); err = rtnl_net_fill(msg, ); if (err < 0) goto err_out; @@ -823,15 +839,19 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, err_out: nlmsg_free(msg); out: + if (put_target) + put_net(target); put_net(peer); return err; } struct rtnl_net_dump_cb { + struct net *tgt_net; struct sk_buff *skb; struct net_fill_args fillargs; int idx; int s_idx; + bool put_tgt_net; }; static int rtnl_net_dumpid_one(int id, void *peer, void *data) @@ -852,10 +872,50 @@ static int rtnl_net_dumpid_one(int id, void *peer, void *data) return 0; } +static int rtnl_valid_dump_net_req(const struct nlmsghdr *nlh, struct sock *sk, + struct rtnl_net_dump_cb *net_cb, + struct netlink_callback *cb) +{ + struct netlink_ext_ack *extack = cb->extack; + struct nlattr *tb[NETNSA_MAX + 1]; + int err, i; + + err = nlmsg_parse_strict(nlh, sizeof(struct rtgenmsg), tb, NETNSA_MAX, +rtnl_net_policy, extack); + if (err < 0) + return err; + + for (i = 0; i <= NETNSA_MAX; i++) { + if (!tb[i]) + continue; + + if (i == NETNSA_TARGET_NSID) { + struct net *net; + + net = rtnl_get_net_ns_capable(sk, nla_get_s32(tb[i])); + if (IS_ERR(net)) { + NL_SET_BAD_ATTR(extack, tb[i]); + NL_SET_ERR_MSG(extack, + "Invalid target network namespace id"); + return PTR_ERR(net); + } + net_cb->tgt_net = net; + net_cb->put_tgt_net = true; + } else { + NL_SET_BAD_ATTR(extack, tb[i]); + NL_SET_ERR_MSG(extack, + "Unsupported attribute in dump request"); + return -EINVAL; + } + } + + return 0; +} + static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb) { - struct net *net = sock_net(skb->sk); struct rtnl_net_dump_cb net_cb = { +
[PATCH net-next v2 1/5] netns: remove net arg from rtnl_net_fill()
This argument is not used anymore. Fixes: cab3c8ec8d57 ("netns: always provide the id to rtnl_net_fill()") Signed-off-by: Nicolas Dichtel --- net/core/net_namespace.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index fefe72774aeb..52b9620e3457 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -739,7 +739,7 @@ static int rtnl_net_get_size(void) } static int rtnl_net_fill(struct sk_buff *skb, u32 portid, u32 seq, int flags, -int cmd, struct net *net, int nsid) +int cmd, int nsid) { struct nlmsghdr *nlh; struct rtgenmsg *rth; @@ -801,7 +801,7 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, id = peernet2id(net, peer); err = rtnl_net_fill(msg, NETLINK_CB(skb).portid, nlh->nlmsg_seq, 0, - RTM_NEWNSID, net, id); + RTM_NEWNSID, id); if (err < 0) goto err_out; @@ -816,7 +816,6 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, } struct rtnl_net_dump_cb { - struct net *net; struct sk_buff *skb; struct netlink_callback *cb; int idx; @@ -833,7 +832,7 @@ static int rtnl_net_dumpid_one(int id, void *peer, void *data) ret = rtnl_net_fill(net_cb->skb, NETLINK_CB(net_cb->cb->skb).portid, net_cb->cb->nlh->nlmsg_seq, NLM_F_MULTI, - RTM_NEWNSID, net_cb->net, id); + RTM_NEWNSID, id); if (ret < 0) return ret; @@ -846,7 +845,6 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb) { struct net *net = sock_net(skb->sk); struct rtnl_net_dump_cb net_cb = { - .net = net, .skb = skb, .cb = cb, .idx = 0, @@ -876,7 +874,7 @@ static void rtnl_net_notifyid(struct net *net, int cmd, int id) if (!msg) goto out; - err = rtnl_net_fill(msg, 0, 0, 0, cmd, net, id); + err = rtnl_net_fill(msg, 0, 0, 0, cmd, id); if (err < 0) goto err_out; -- 2.18.0
[PATCH net-next v2 2/5] netns: introduce 'struct net_fill_args'
This is a preparatory work. To avoid having to much arguments for the function rtnl_net_fill(), a new structure is defined. Signed-off-by: Nicolas Dichtel --- net/core/net_namespace.c | 48 1 file changed, 34 insertions(+), 14 deletions(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 52b9620e3457..f8a5966b086c 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -738,20 +738,28 @@ static int rtnl_net_get_size(void) ; } -static int rtnl_net_fill(struct sk_buff *skb, u32 portid, u32 seq, int flags, -int cmd, int nsid) +struct net_fill_args { + u32 portid; + u32 seq; + int flags; + int cmd; + int nsid; +}; + +static int rtnl_net_fill(struct sk_buff *skb, struct net_fill_args *args) { struct nlmsghdr *nlh; struct rtgenmsg *rth; - nlh = nlmsg_put(skb, portid, seq, cmd, sizeof(*rth), flags); + nlh = nlmsg_put(skb, args->portid, args->seq, args->cmd, sizeof(*rth), + args->flags); if (!nlh) return -EMSGSIZE; rth = nlmsg_data(nlh); rth->rtgen_family = AF_UNSPEC; - if (nla_put_s32(skb, NETNSA_NSID, nsid)) + if (nla_put_s32(skb, NETNSA_NSID, args->nsid)) goto nla_put_failure; nlmsg_end(skb, nlh); @@ -767,10 +775,15 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, { struct net *net = sock_net(skb->sk); struct nlattr *tb[NETNSA_MAX + 1]; + struct net_fill_args fillargs = { + .portid = NETLINK_CB(skb).portid, + .seq = nlh->nlmsg_seq, + .cmd = RTM_NEWNSID, + }; struct nlattr *nla; struct sk_buff *msg; struct net *peer; - int err, id; + int err; err = nlmsg_parse(nlh, sizeof(struct rtgenmsg), tb, NETNSA_MAX, rtnl_net_policy, extack); @@ -799,9 +812,8 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, goto out; } - id = peernet2id(net, peer); - err = rtnl_net_fill(msg, NETLINK_CB(skb).portid, nlh->nlmsg_seq, 0, - RTM_NEWNSID, id); + fillargs.nsid = peernet2id(net, peer); + err = rtnl_net_fill(msg, ); if (err < 0) goto err_out; @@ -817,7 +829,7 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, struct rtnl_net_dump_cb { struct sk_buff *skb; - struct netlink_callback *cb; + struct net_fill_args fillargs; int idx; int s_idx; }; @@ -830,9 +842,8 @@ static int rtnl_net_dumpid_one(int id, void *peer, void *data) if (net_cb->idx < net_cb->s_idx) goto cont; - ret = rtnl_net_fill(net_cb->skb, NETLINK_CB(net_cb->cb->skb).portid, - net_cb->cb->nlh->nlmsg_seq, NLM_F_MULTI, - RTM_NEWNSID, id); + net_cb->fillargs.nsid = id; + ret = rtnl_net_fill(net_cb->skb, _cb->fillargs); if (ret < 0) return ret; @@ -846,7 +857,12 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb) struct net *net = sock_net(skb->sk); struct rtnl_net_dump_cb net_cb = { .skb = skb, - .cb = cb, + .fillargs = { + .portid = NETLINK_CB(cb->skb).portid, + .seq = cb->nlh->nlmsg_seq, + .flags = NLM_F_MULTI, + .cmd = RTM_NEWNSID, + }, .idx = 0, .s_idx = cb->args[0], }; @@ -867,6 +883,10 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb) static void rtnl_net_notifyid(struct net *net, int cmd, int id) { + struct net_fill_args fillargs = { + .cmd = cmd, + .nsid = id, + }; struct sk_buff *msg; int err = -ENOMEM; @@ -874,7 +894,7 @@ static void rtnl_net_notifyid(struct net *net, int cmd, int id) if (!msg) goto out; - err = rtnl_net_fill(msg, 0, 0, 0, cmd, id); + err = rtnl_net_fill(msg, ); if (err < 0) goto err_out; -- 2.18.0
[PATCH net-next v2 4/5] netns: enable to specify a nsid for a get request
Combined with NETNSA_TARGET_NSID, it enables to "translate" a nsid from one netns to a nsid of another netns. This is useful when using NETLINK_F_LISTEN_ALL_NSID because it helps the user to interpret a nsid received from an other netns. Signed-off-by: Nicolas Dichtel --- net/core/net_namespace.c | 5 + 1 file changed, 5 insertions(+) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 885c54197e31..dd25fb22ad45 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -797,6 +797,11 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, } else if (tb[NETNSA_FD]) { peer = get_net_ns_by_fd(nla_get_u32(tb[NETNSA_FD])); nla = tb[NETNSA_FD]; + } else if (tb[NETNSA_NSID]) { + peer = get_net_ns_by_id(net, nla_get_u32(tb[NETNSA_NSID])); + if (!peer) + peer = ERR_PTR(-ENOENT); + nla = tb[NETNSA_NSID]; } else { NL_SET_ERR_MSG(extack, "Peer netns reference is missing"); return -EINVAL; -- 2.18.0
[PATCH net-next v2 5/5] netns: enable to dump full nsid translation table
Like the previous patch, the goal is to ease to convert nsids from one netns to another netns. A new attribute (NETNSA_CURRENT_NSID) is added to the kernel answer when NETNSA_TARGET_NSID is provided, thus the user can easily convert nsids. Signed-off-by: Nicolas Dichtel --- include/uapi/linux/net_namespace.h | 1 + net/core/net_namespace.c | 30 -- 2 files changed, 25 insertions(+), 6 deletions(-) diff --git a/include/uapi/linux/net_namespace.h b/include/uapi/linux/net_namespace.h index 0ed9dd61d32a..9f9956809565 100644 --- a/include/uapi/linux/net_namespace.h +++ b/include/uapi/linux/net_namespace.h @@ -17,6 +17,7 @@ enum { NETNSA_PID, NETNSA_FD, NETNSA_TARGET_NSID, + NETNSA_CURRENT_NSID, __NETNSA_MAX, }; diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index dd25fb22ad45..25030e0317a2 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -745,6 +745,8 @@ struct net_fill_args { int flags; int cmd; int nsid; + bool add_ref; + int ref_nsid; }; static int rtnl_net_fill(struct sk_buff *skb, struct net_fill_args *args) @@ -763,6 +765,10 @@ static int rtnl_net_fill(struct sk_buff *skb, struct net_fill_args *args) if (nla_put_s32(skb, NETNSA_NSID, args->nsid)) goto nla_put_failure; + if (args->add_ref && + nla_put_s32(skb, NETNSA_CURRENT_NSID, args->ref_nsid)) + goto nla_put_failure; + nlmsg_end(skb, nlh); return 0; @@ -782,7 +788,6 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, .cmd = RTM_NEWNSID, }; struct net *peer, *target = net; - bool put_target = false; struct nlattr *nla; struct sk_buff *msg; int err; @@ -824,7 +829,8 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, err = PTR_ERR(target); goto out; } - put_target = true; + fillargs.add_ref = true; + fillargs.ref_nsid = peernet2id(net, peer); } msg = nlmsg_new(rtnl_net_get_size(), GFP_KERNEL); @@ -844,7 +850,7 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, err_out: nlmsg_free(msg); out: - if (put_target) + if (fillargs.add_ref) put_net(target); put_net(peer); return err; @@ -852,11 +858,11 @@ static int rtnl_net_getid(struct sk_buff *skb, struct nlmsghdr *nlh, struct rtnl_net_dump_cb { struct net *tgt_net; + struct net *ref_net; struct sk_buff *skb; struct net_fill_args fillargs; int idx; int s_idx; - bool put_tgt_net; }; static int rtnl_net_dumpid_one(int id, void *peer, void *data) @@ -868,6 +874,8 @@ static int rtnl_net_dumpid_one(int id, void *peer, void *data) goto cont; net_cb->fillargs.nsid = id; + if (net_cb->fillargs.add_ref) + net_cb->fillargs.ref_nsid = __peernet2id(net_cb->ref_net, peer); ret = rtnl_net_fill(net_cb->skb, _cb->fillargs); if (ret < 0) return ret; @@ -904,8 +912,9 @@ static int rtnl_valid_dump_net_req(const struct nlmsghdr *nlh, struct sock *sk, "Invalid target network namespace id"); return PTR_ERR(net); } + net_cb->fillargs.add_ref = true; + net_cb->ref_net = net_cb->tgt_net; net_cb->tgt_net = net; - net_cb->put_tgt_net = true; } else { NL_SET_BAD_ATTR(extack, tb[i]); NL_SET_ERR_MSG(extack, @@ -940,12 +949,21 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb) } spin_lock_bh(_cb.tgt_net->nsid_lock); + if (net_cb.fillargs.add_ref && + !net_eq(net_cb.ref_net, net_cb.tgt_net) && + !spin_trylock_bh(_cb.ref_net->nsid_lock)) { + err = -EAGAIN; + goto end; + } idr_for_each(_cb.tgt_net->netns_ids, rtnl_net_dumpid_one, _cb); + if (net_cb.fillargs.add_ref && + !net_eq(net_cb.ref_net, net_cb.tgt_net)) + spin_unlock_bh(_cb.ref_net->nsid_lock); spin_unlock_bh(_cb.tgt_net->nsid_lock); cb->args[0] = net_cb.idx; end: - if (net_cb.put_tgt_net) + if (net_cb.fillargs.add_ref) put_net(net_cb.tgt_net); return err < 0 ? err : skb->len; } -- 2.18.0
Re: DSA support for Marvell 88e6065 switch
On Thu, Nov 22, 2018 at 02:21:23PM +0100, Pavel Machek wrote: > > > If I wanted it to work, what do I need to do? AFAICT phy autoprobing > > > should just attach it as soon as it is compiled in? > > > > Nope. It is a switch, not a PHY. Switches are never auto-probed > > because they are not guaranteed to have ID registers. > > > > You need to use the legacy device tree binding. Look in > > Documentation/devicetree/bindings/net/dsa/dsa.txt, section Deprecated > > Binding. You can get more examples if you checkout old kernels. Or > > kirkwood-rd88f6281.dtsi, the dsa { } node which is disabled. > > Thanks; I ported code from mv88e66xx in the meantime, and switch > appears to be detected. > > But I'm running into problems with tagging code, and I guess I'd like > some help understanding. > > tag_trailer: allocates new skb, then copies data around. > > tag_qca: does dev->stats.tx_packets++, and reuses existing skb. > > tag_brcm: reuses existing skb. > > Is qca wrong in adjusting the statistics? Why does trailer allocate > new skb? > > 6065 seems to use 2-byte header between "SFD" and "Destination > address" in the ethernet frame. That's ... strange place to put > header, as addresses are now shifted. I need to put ethernet in > promisc mode (by running tcpdump) to get data moving.. and can not > figure out what to do in tag_... Does this switch chip not also support trailer mode? There's basically four tagging modes for Marvell switch chips: header mode (the one you described), trailer mode (tag_trailer.c), DSA and ethertype DSA. The switch chips I worked on that didn't support (ethertype) DSA tagging did support both header and trailer modes, and I chose to run them in trailer mode for the reasons you describe above, but if your chip doesn't support trailer mode, then yes, you'll have to add support for header mode and put the underlying interface into promiscuous mode and such.
[PATCH net] be2net: Fix NULL pointer dereference in be_tx_timeout()
The driver enumerates Tx queues in ndo_tx_timeout() handler, here is possible race with be_update_queues. For this case we set carrier_off. It prevents netdev watchdog to be fired after be_clear_queues(). The watchdog timeout doesn't make any sense here as we re-creating queues. Reproducer: We can reproduce bug with ethtool when changing queue count ethtool -L $netif combined 1 ethtool -L $netif combined 32 If oops is not triggered imediately, just run it again or in loop. Oops: [ 865.768648] NETDEV WATCHDOG: enp4s0f0 (be2net): transmit queue 0 timed out [ 865.775539] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x20d/0x220 [ 865.783796] Modules linked in: be2net intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel mei_me intel_cstate intel_uncore ipmi_ssif mei ipmi_si pcspkr sg i2c_i801 joydev lpc_ich intel_rapl_perf ipmi_devintf ioatdma ipmi_msghandler xfs libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci libahci crc32c_intel drm serio_raw libata igb dca i2c_algo_bit wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: be2net] [ 865.834289] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.20.0-rc3+ #2 [ 865.840640] Hardware name: Supermicro X9DBU/X9DBU, BIOS 3.2 01/15/2015 [ 865.847168] RIP: 0010:dev_watchdog+0x20d/0x220 [ 865.851612] Code: 00 49 63 4e e0 eb 92 4c 89 e7 c6 05 a5 de c9 00 01 e8 f7 b2 fc ff 89 d9 4c 89 e6 48 c7 c7 a0 d1 b2 99 48 89 c2 e8 7d b0 98 ff <0f> 0b eb c0 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 [ 865.870358] RSP: 0018:9bee73ac3e88 EFLAGS: 00010282 [ 865.875583] RAX: RBX: RCX: 083f [ 865.882707] RDX: RSI: 00f6 RDI: 003f [ 865.889832] RBP: 9bee5fa0045c R08: 0824 R09: 0007 [ 865.896956] R10: R11: 9a3f162d R12: 9bee5fa0 [ 865.904088] R13: 0003 R14: 9bee5fa00480 R15: 0020 [ 865.911214] FS: () GS:9bee73ac() knlGS: [ 865.919298] CS: 0010 DS: ES: CR0: 80050033 [ 865.925037] CR2: 5580497ce040 CR3: 0002cf60a004 CR4: 000606e0 [ 865.932170] Call Trace: [ 865.934626] [ 865.936645] ? pfifo_fast_dequeue+0x160/0x160 [ 865.941005] call_timer_fn+0x2b/0x130 [ 865.944670] run_timer_softirq+0x3b9/0x3f0 [ 865.948768] ? tick_sched_timer+0x37/0x70 [ 865.952779] ? __hrtimer_run_queues+0x110/0x280 [ 865.957314] __do_softirq+0xdd/0x2fe [ 865.960896] irq_exit+0xfa/0x100 [ 865.964125] smp_apic_timer_interrupt+0x74/0x140 [ 865.968745] apic_timer_interrupt+0xf/0x20 [ 865.972844] [ 865.974953] RIP: 0010:cpuidle_enter_state+0xb0/0x320 [ 865.979915] Code: 89 c3 66 66 66 66 90 31 ff e8 0c 07 a6 ff 80 7c 24 0b 00 74 12 9c 58 f6 c4 02 0f 85 46 02 00 00 31 ff e8 33 e0 ab ff fb 85 ed <0f> 88 1a 02 00 00 48 b8 ff ff ff ff f3 01 00 00 48 2b 1c 24 48 39 [ 865.998661] RSP: 0018:bc9ac19e7ea0 EFLAGS: 0206 ORIG_RAX: ff13 [ 866.006225] RAX: 9bee73ae1dc0 RBX: 00c9938e11ae RCX: 001f [ 866.013350] RDX: 00c9938e11ae RSI: 435e532a RDI: [ 866.020474] RBP: 0005 R08: 0002 R09: 00021640 [ 866.027598] R10: 9c434b946fde R11: 9bee73ae0e44 R12: 99d27538 [ 866.034723] R13: 9bee73aec628 R14: 0005 R15: [ 866.041860] do_idle+0x1f1/0x230 [ 866.045091] cpu_startup_entry+0x19/0x20 [ 866.049016] start_secondary+0x195/0x1e0 [ 866.052943] secondary_startup_64+0xb6/0xc0 [ 866.057129] ---[ end trace dead88c26bcd8261 ]--- [ 866.061750] be2net :04:00.0: TXQ Dump: 0 H: 0 T: 0 used: 0, qid: 0x2 [ 866.068452] BUG: unable to handle kernel NULL pointer dereference at [ 866.076273] PGD 0 P4D 0 [ 866.078810] Oops: [#1] SMP PTI [ 866.082305] CPU: 3 PID: 0 Comm: swapper/3 Tainted: GW 4.20.0-rc3+ #2 [ 866.090041] Hardware name: Supermicro X9DBU/X9DBU, BIOS 3.2 01/15/2015 [ 866.096566] RIP: 0010:be_tx_timeout+0x7c/0x300 [be2net] [ 866.101786] Code: 8b 45 1c 41 8b 4d 14 48 89 df 31 ed 45 8b 4d 18 48 c7 c6 80 51 2c c0 50 45 8b 45 10 8b 54 24 14 e8 09 a7 cb d8 4d 8b 7d 20 59 <41> 8b 0c af 45 8b 44 af 04 41 8b 74 af 0c 45 8b 4c af 08 89 ca 44 [ 866.120532] RSP: 0018:9bee73ac3e38 EFLAGS: 00010246 [ 866.125758] RAX: RBX: 9bee72d6b0b0 RCX: 0002 [ 866.132882] RDX: RSI: 00f6 RDI: 003f [ 866.140014] RBP: R08: 084d R09: 0007 [ 866.147138] R10: R11: 9a3f162d R12: c02c60ab [ 866.154263] R13: 9bee5fa04b40 R14: c02c613a R15: [ 866.161388] FS:
[PATCH v3 0/4] Fix unsafe BPF_PROG_TEST_RUN interface
Right now, there is no safe way to use BPF_PROG_TEST_RUN with data_out. This is because bpf_test_finish copies the output buffer to user space without checking its size. This can lead to the kernel overwriting data in user space after the buffer if xdp_adjust_head and friends are in play. Changes in v3: * Introduce bpf_prog_test_run_xattr instead of modifying the existing function Changes in v2: * Make the syscall return ENOSPC if data_size_out is too small * Make bpf_prog_test_run return EINVAL if size_out is missing * Document the new behaviour of data_size_out Lorenz Bauer (4): bpf: respect size hint to BPF_PROG_TEST_RUN if present tools: sync uapi/linux/bpf.h libbpf: add bpf_prog_test_run_xattr selftests: add a test for bpf_prog_test_run_xattr include/uapi/linux/bpf.h | 7 +++- net/bpf/test_run.c | 15 +++- tools/include/uapi/linux/bpf.h | 7 +++- tools/lib/bpf/bpf.c | 27 + tools/lib/bpf/bpf.h | 13 +++ tools/testing/selftests/bpf/test_progs.c | 49 6 files changed, 112 insertions(+), 6 deletions(-) -- 2.17.1
[PATCH v3 4/4] selftests: add a test for bpf_prog_test_run_xattr
Make sure that bpf_prog_test_run_xattr returns the correct length and that the kernel respects the output size hint. Also check that errno indicates ENOSPC if there is a short output buffer given. Signed-off-by: Lorenz Bauer --- tools/testing/selftests/bpf/test_progs.c | 49 1 file changed, 49 insertions(+) diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c index c1e688f61061..f9f5b1dbcc83 100644 --- a/tools/testing/selftests/bpf/test_progs.c +++ b/tools/testing/selftests/bpf/test_progs.c @@ -124,6 +124,54 @@ static void test_pkt_access(void) bpf_object__close(obj); } +static void test_prog_run_xattr(void) +{ + const char *file = "./test_pkt_access.o"; + __u32 duration, retval, size_out; + struct bpf_object *obj; + char buf[10]; + int err; + struct bpf_prog_test_run_attr tattr = { + .repeat = 1, + .data = _v4, + .size = sizeof(pkt_v4), + .data_out = buf, + .size_out = 5, + }; + + err = bpf_prog_load(file, BPF_PROG_TYPE_SCHED_CLS, , + _fd); + if (CHECK(err, "load", "err %d errno %d\n", err, errno)) + return; + + memset(buf, 0, sizeof(buf)); + + err = bpf_prog_test_run_xattr(, _out, , ); + CHECK(err != -1 || errno != ENOSPC || retval, "run", + "err %d errno %d retval %d\n", err, errno, retval); + + CHECK(size_out != sizeof(pkt_v4), "output_size", + "incorrect output size, want %lu have %u\n", + sizeof(pkt_v4), size_out); + + CHECK(buf[5] != 0, "overflow", + "BPF_PROG_TEST_RUN ignored size hint\n"); + + tattr.data_out = NULL; + tattr.size_out = 0; + errno = 0; + + err = bpf_prog_test_run_xattr(, NULL, , ); + CHECK(err || errno || retval, "run_no_output", + "err %d errno %d retval %d\n", err, errno, retval); + + tattr.size_out = 1; + err = bpf_prog_test_run_xattr(, NULL, NULL, ); + CHECK(err != -EINVAL, "run_wrong_size_out", "err %d\n", err); + + bpf_object__close(obj); +} + static void test_xdp(void) { struct vip key4 = {.protocol = 6, .family = AF_INET}; @@ -1837,6 +1885,7 @@ int main(void) jit_enabled = is_jit_enabled(); test_pkt_access(); + test_prog_run_xattr(); test_xdp(); test_xdp_adjust_tail(); test_l4lb_all(); -- 2.17.1