Re: [next-queue PATCH v5 3/5] net/sched: Introduce Credit Based Shaper (CBS) qdisc
Wed, Oct 11, 2017 at 02:43:58AM CEST, vinicius.go...@intel.com wrote: >This queueing discipline implements the shaper algorithm defined by >the 802.1Q-2014 Section 8.6.8.2 and detailed in Annex L. > >It's primary usage is to apply some bandwidth reservation to user >defined traffic classes, which are mapped to different queues via the >mqprio qdisc. > >Only a simple software implementation is added for now. > >Signed-off-by: Vinicius Costa Gomes >Signed-off-by: Jesus Sanchez-Palencia >--- > include/linux/netdevice.h | 1 + > include/net/pkt_sched.h| 9 ++ > include/uapi/linux/pkt_sched.h | 18 +++ > net/sched/Kconfig | 11 ++ > net/sched/Makefile | 1 + > net/sched/sch_cbs.c| 305 + > 6 files changed, 345 insertions(+) > create mode 100644 net/sched/sch_cbs.c > >diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h >index 31bb3010c69b..1f6c44ef5b21 100644 >--- a/include/linux/netdevice.h >+++ b/include/linux/netdevice.h >@@ -775,6 +775,7 @@ enum tc_setup_type { > TC_SETUP_CLSFLOWER, > TC_SETUP_CLSMATCHALL, > TC_SETUP_CLSBPF, >+ TC_SETUP_CBS, > }; > > /* These structures hold the attributes of xdp state that are being passed >diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h >index 259bc191ba59..7c597b050b36 100644 >--- a/include/net/pkt_sched.h >+++ b/include/net/pkt_sched.h >@@ -146,4 +146,13 @@ static inline bool is_classid_clsact_egress(u32 classid) > TC_H_MIN(classid) == TC_H_MIN(TC_H_MIN_EGRESS); > } > >+struct tc_cbs_qopt_offload { >+ u8 enable; >+ s32 queue; >+ s32 hicredit; >+ s32 locredit; >+ s32 idleslope; >+ s32 sendslope; Please introduce the qdisc in one patch, then offload it in second. That is what I requested already. 2 patches please. [...] >+static struct Qdisc_ops cbs_qdisc_ops __read_mostly = { >+ .next = NULL, It is already 0, no need to re-init. >+ .id = "cbs", >+ .priv_size = sizeof(struct cbs_sched_data), >+ .enqueue= cbs_enqueue, >+ .dequeue= cbs_dequeue, >+ .peek = qdisc_peek_dequeued, >+ .init = cbs_init, >+ .reset = qdisc_reset_queue, >+ .destroy= cbs_destroy, >+ .change = cbs_change, >+ .dump = cbs_dump, >+ .owner = THIS_MODULE, >+};
Re: [PATCH] wcn36xx: Remove unnecessary rcu_read_unlock in wcn36xx_bss_info_changed
On Sun 08 Oct 06:06 PDT 2017, Jia-Ju Bai wrote: > No rcu_read_lock is called, but rcu_read_unlock is still called. > Thus rcu_read_unlock should be removed. > Thanks, not sure how I could miss that one. Kalle can you please include this in a v4.14-rc pull request? : Fixes: 39efc7cc7ccf ("wcn36xx: Introduce mutual exclusion of fw configuration") > Signed-off-by: Jia-Ju Bai Acked-by: Bjorn Andersson Regards, Bjorn > --- > drivers/net/wireless/ath/wcn36xx/main.c |1 - > 1 file changed, 1 deletion(-) > > diff --git a/drivers/net/wireless/ath/wcn36xx/main.c > b/drivers/net/wireless/ath/wcn36xx/main.c > index 35bd50b..b83f01d 100644 > --- a/drivers/net/wireless/ath/wcn36xx/main.c > +++ b/drivers/net/wireless/ath/wcn36xx/main.c > @@ -812,7 +812,6 @@ static void wcn36xx_bss_info_changed(struct ieee80211_hw > *hw, > if (!sta) { > wcn36xx_err("sta %pM is not found\n", > bss_conf->bssid); > - rcu_read_unlock(); > goto out; > } > sta_priv = wcn36xx_sta_to_priv(sta); > -- > 1.7.9.5 > > > > ___ > wcn36xx mailing list > wcn3...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/wcn36xx
[PATCH v2 1/7] net: qrtr: Invoke sk_error_report() after setting sk_err
Rather than manually waking up any context sleeping on the sock to signal an error we should call sk_error_report(). This has the added benefit that in-kernel consumers can override this notification with its own callback. Signed-off-by: Bjorn Andersson --- Changes since v1: - None net/qrtr/qrtr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/qrtr/qrtr.c b/net/qrtr/qrtr.c index c2f5c13550c0..7e4b49a8349e 100644 --- a/net/qrtr/qrtr.c +++ b/net/qrtr/qrtr.c @@ -541,7 +541,7 @@ static void qrtr_reset_ports(void) sock_hold(&ipc->sk); ipc->sk.sk_err = ENETRESET; - wake_up_interruptible(sk_sleep(&ipc->sk)); + ipc->sk.sk_error_report(&ipc->sk); sock_put(&ipc->sk); } mutex_unlock(&qrtr_port_lock); -- 2.12.0
[PATCH v2 4/7] net: qrtr: Pass source and destination to enqueue functions
Defer writing the message header to the skb until its time to enqueue the packet. As the receive path is reworked to decode the message header as it's received from the transport and only pass around the payload in the skb this change means that we do not have to fill out the full message header just to decode it immediately in qrtr_local_enqueue(). In the future this change also makes it possible to prepend message headers based on the version of each link. Signed-off-by: Bjorn Andersson --- Changes since v1: - None net/qrtr/qrtr.c | 120 1 file changed, 69 insertions(+), 51 deletions(-) diff --git a/net/qrtr/qrtr.c b/net/qrtr/qrtr.c index d85ca7170b8f..82dc83789310 100644 --- a/net/qrtr/qrtr.c +++ b/net/qrtr/qrtr.c @@ -97,8 +97,12 @@ struct qrtr_node { struct list_head item; }; -static int qrtr_local_enqueue(struct qrtr_node *node, struct sk_buff *skb); -static int qrtr_bcast_enqueue(struct qrtr_node *node, struct sk_buff *skb); +static int qrtr_local_enqueue(struct qrtr_node *node, struct sk_buff *skb, + int type, struct sockaddr_qrtr *from, + struct sockaddr_qrtr *to); +static int qrtr_bcast_enqueue(struct qrtr_node *node, struct sk_buff *skb, + int type, struct sockaddr_qrtr *from, + struct sockaddr_qrtr *to); /* Release node resources and free the node. * @@ -136,10 +140,27 @@ static void qrtr_node_release(struct qrtr_node *node) } /* Pass an outgoing packet socket buffer to the endpoint driver. */ -static int qrtr_node_enqueue(struct qrtr_node *node, struct sk_buff *skb) +static int qrtr_node_enqueue(struct qrtr_node *node, struct sk_buff *skb, +int type, struct sockaddr_qrtr *from, +struct sockaddr_qrtr *to) { + struct qrtr_hdr *hdr; + size_t len = skb->len; int rc = -ENODEV; + hdr = skb_push(skb, QRTR_HDR_SIZE); + hdr->version = cpu_to_le32(QRTR_PROTO_VER); + hdr->type = cpu_to_le32(type); + hdr->src_node_id = cpu_to_le32(from->sq_node); + hdr->src_port_id = cpu_to_le32(from->sq_port); + hdr->dst_node_id = cpu_to_le32(to->sq_node); + hdr->dst_port_id = cpu_to_le32(to->sq_port); + + hdr->size = cpu_to_le32(len); + hdr->confirm_rx = 0; + + skb_put_padto(skb, ALIGN(len, 4)); + mutex_lock(&node->ep_lock); if (node->ep) rc = node->ep->xmit(node->ep, skb); @@ -237,23 +258,13 @@ EXPORT_SYMBOL_GPL(qrtr_endpoint_post); static struct sk_buff *qrtr_alloc_ctrl_packet(u32 type, size_t pkt_len, u32 src_node, u32 dst_node) { - struct qrtr_hdr *hdr; struct sk_buff *skb; skb = alloc_skb(QRTR_HDR_SIZE + pkt_len, GFP_KERNEL); if (!skb) return NULL; - skb_reset_transport_header(skb); - hdr = skb_put(skb, QRTR_HDR_SIZE); - hdr->version = cpu_to_le32(QRTR_PROTO_VER); - hdr->type = cpu_to_le32(type); - hdr->src_node_id = cpu_to_le32(src_node); - hdr->src_port_id = cpu_to_le32(QRTR_PORT_CTRL); - hdr->confirm_rx = cpu_to_le32(0); - hdr->size = cpu_to_le32(pkt_len); - hdr->dst_node_id = cpu_to_le32(dst_node); - hdr->dst_port_id = cpu_to_le32(QRTR_PORT_CTRL); + skb_reserve(skb, QRTR_HDR_SIZE); return skb; } @@ -326,6 +337,8 @@ static void qrtr_port_put(struct qrtr_sock *ipc); static void qrtr_node_rx_work(struct work_struct *work) { struct qrtr_node *node = container_of(work, struct qrtr_node, work); + struct sockaddr_qrtr dst; + struct sockaddr_qrtr src; struct sk_buff *skb; while ((skb = skb_dequeue(&node->rx_queue)) != NULL) { @@ -341,6 +354,11 @@ static void qrtr_node_rx_work(struct work_struct *work) dst_port = le32_to_cpu(phdr->dst_port_id); confirm = !!phdr->confirm_rx; + src.sq_node = src_node; + src.sq_port = le32_to_cpu(phdr->src_port_id); + dst.sq_node = dst_node; + dst.sq_port = dst_port; + qrtr_node_assign(node, src_node); ipc = qrtr_port_lookup(dst_port); @@ -357,7 +375,9 @@ static void qrtr_node_rx_work(struct work_struct *work) skb = qrtr_alloc_resume_tx(dst_node, node->nid, dst_port); if (!skb) break; - if (qrtr_node_enqueue(node, skb)) + + if (qrtr_node_enqueue(node, skb, QRTR_TYPE_RESUME_TX, + &dst, &src)) break; } } @@ -407,6 +427,8 @@ EXPORT_SYMBOL_GPL(qrtr_endpoint_register); void qrtr_endpoint_unregister(struct qrtr_endpoint *ep) { struct qrtr_node *node = ep->node; + stru
[PATCH v2 2/7] net: qrtr: Move constants to header file
The constants are used by both the name server and clients, so clarify their value and move them to the uapi header. Signed-off-by: Bjorn Andersson --- Changes since v1: - None include/uapi/linux/qrtr.h | 3 +++ net/qrtr/qrtr.c | 2 -- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/qrtr.h b/include/uapi/linux/qrtr.h index 9d76c566f66e..63e8803e4d90 100644 --- a/include/uapi/linux/qrtr.h +++ b/include/uapi/linux/qrtr.h @@ -4,6 +4,9 @@ #include #include +#define QRTR_NODE_BCAST0xu +#define QRTR_PORT_CTRL 0xfffeu + struct sockaddr_qrtr { __kernel_sa_family_t sq_family; __u32 sq_node; diff --git a/net/qrtr/qrtr.c b/net/qrtr/qrtr.c index 7e4b49a8349e..15981abc042c 100644 --- a/net/qrtr/qrtr.c +++ b/net/qrtr/qrtr.c @@ -61,8 +61,6 @@ struct qrtr_hdr { } __packed; #define QRTR_HDR_SIZE sizeof(struct qrtr_hdr) -#define QRTR_NODE_BCAST ((unsigned int)-1) -#define QRTR_PORT_CTRL ((unsigned int)-2) struct qrtr_sock { /* WARNING: sk must be the first member */ -- 2.12.0
[PATCH v2 0/7] net: qrtr: Fixes and support receiving version 2 packets
On the latest Qualcomm platforms remote processors are sending packets with version 2 of the message header. This series starts off with some fixes and then refactors the qrtr code to support receiving messages of both version 1 and version 2. As all remotes are backwards compatible transmitted packets continues to be send as version 1, but some groundwork has been done to make this a per-link property. Bjorn Andersson (7): net: qrtr: Invoke sk_error_report() after setting sk_err net: qrtr: Move constants to header file net: qrtr: Add control packet definition to uapi net: qrtr: Pass source and destination to enqueue functions net: qrtr: Clean up control packet handling net: qrtr: Use sk_buff->cb in receive path net: qrtr: Support decoding incoming v2 packets include/uapi/linux/qrtr.h | 35 + net/qrtr/qrtr.c | 377 +- 2 files changed, 241 insertions(+), 171 deletions(-) -- 2.12.0
[PATCH v2 3/7] net: qrtr: Add control packet definition to uapi
The QMUX protocol specification defines structure of the special control packet messages being sent between handlers of the control port. Add these to the uapi header, as this structure and the associated types are shared between the kernel and all userspace handlers of control messages. Signed-off-by: Bjorn Andersson --- Changes since v1: - None include/uapi/linux/qrtr.h | 32 net/qrtr/qrtr.c | 12 2 files changed, 32 insertions(+), 12 deletions(-) diff --git a/include/uapi/linux/qrtr.h b/include/uapi/linux/qrtr.h index 63e8803e4d90..179af64846e0 100644 --- a/include/uapi/linux/qrtr.h +++ b/include/uapi/linux/qrtr.h @@ -13,4 +13,36 @@ struct sockaddr_qrtr { __u32 sq_port; }; +enum qrtr_pkt_type { + QRTR_TYPE_DATA = 1, + QRTR_TYPE_HELLO = 2, + QRTR_TYPE_BYE = 3, + QRTR_TYPE_NEW_SERVER= 4, + QRTR_TYPE_DEL_SERVER= 5, + QRTR_TYPE_DEL_CLIENT= 6, + QRTR_TYPE_RESUME_TX = 7, + QRTR_TYPE_EXIT = 8, + QRTR_TYPE_PING = 9, + QRTR_TYPE_NEW_LOOKUP= 10, + QRTR_TYPE_DEL_LOOKUP= 11, +}; + +struct qrtr_ctrl_pkt { + __le32 cmd; + + union { + struct { + __le32 service; + __le32 instance; + __le32 node; + __le32 port; + } server; + + struct { + __le32 node; + __le32 port; + } client; + }; +} __packed; + #endif /* _LINUX_QRTR_H */ diff --git a/net/qrtr/qrtr.c b/net/qrtr/qrtr.c index 15981abc042c..d85ca7170b8f 100644 --- a/net/qrtr/qrtr.c +++ b/net/qrtr/qrtr.c @@ -26,18 +26,6 @@ #define QRTR_MIN_EPH_SOCKET 0x4000 #define QRTR_MAX_EPH_SOCKET 0x7fff -enum qrtr_pkt_type { - QRTR_TYPE_DATA = 1, - QRTR_TYPE_HELLO = 2, - QRTR_TYPE_BYE = 3, - QRTR_TYPE_NEW_SERVER= 4, - QRTR_TYPE_DEL_SERVER= 5, - QRTR_TYPE_DEL_CLIENT= 6, - QRTR_TYPE_RESUME_TX = 7, - QRTR_TYPE_EXIT = 8, - QRTR_TYPE_PING = 9, -}; - /** * struct qrtr_hdr - (I|R)PCrouter packet header * @version: protocol version -- 2.12.0
[PATCH v2 5/7] net: qrtr: Clean up control packet handling
As the message header generation is deferred the internal functions for generating control packets can be simplified. This patch modifies qrtr_alloc_ctrl_packet() to, in addition to the sk_buff, return a reference to a struct qrtr_ctrl_pkt, which clarifies and simplifies the helpers to the point that these functions can be folded back into the callers. Signed-off-by: Bjorn Andersson --- Changes since v1: - None net/qrtr/qrtr.c | 93 ++--- 1 file changed, 29 insertions(+), 64 deletions(-) diff --git a/net/qrtr/qrtr.c b/net/qrtr/qrtr.c index 82dc83789310..a84edba7b1ef 100644 --- a/net/qrtr/qrtr.c +++ b/net/qrtr/qrtr.c @@ -255,9 +255,18 @@ int qrtr_endpoint_post(struct qrtr_endpoint *ep, const void *data, size_t len) } EXPORT_SYMBOL_GPL(qrtr_endpoint_post); -static struct sk_buff *qrtr_alloc_ctrl_packet(u32 type, size_t pkt_len, - u32 src_node, u32 dst_node) +/** + * qrtr_alloc_ctrl_packet() - allocate control packet skb + * @pkt: reference to qrtr_ctrl_pkt pointer + * + * Returns newly allocated sk_buff, or NULL on failure + * + * This function allocates a sk_buff large enough to carry a qrtr_ctrl_pkt and + * on success returns a reference to the control packet in @pkt. + */ +static struct sk_buff *qrtr_alloc_ctrl_packet(struct qrtr_ctrl_pkt **pkt) { + const int pkt_len = sizeof(struct qrtr_ctrl_pkt); struct sk_buff *skb; skb = alloc_skb(QRTR_HDR_SIZE + pkt_len, GFP_KERNEL); @@ -265,64 +274,7 @@ static struct sk_buff *qrtr_alloc_ctrl_packet(u32 type, size_t pkt_len, return NULL; skb_reserve(skb, QRTR_HDR_SIZE); - - return skb; -} - -/* Allocate and construct a resume-tx packet. */ -static struct sk_buff *qrtr_alloc_resume_tx(u32 src_node, - u32 dst_node, u32 port) -{ - const int pkt_len = 20; - struct sk_buff *skb; - __le32 *buf; - - skb = qrtr_alloc_ctrl_packet(QRTR_TYPE_RESUME_TX, pkt_len, -src_node, dst_node); - if (!skb) - return NULL; - - buf = skb_put_zero(skb, pkt_len); - buf[0] = cpu_to_le32(QRTR_TYPE_RESUME_TX); - buf[1] = cpu_to_le32(src_node); - buf[2] = cpu_to_le32(port); - - return skb; -} - -/* Allocate and construct a BYE message to signal remote termination */ -static struct sk_buff *qrtr_alloc_local_bye(u32 src_node) -{ - const int pkt_len = 20; - struct sk_buff *skb; - __le32 *buf; - - skb = qrtr_alloc_ctrl_packet(QRTR_TYPE_BYE, pkt_len, -src_node, qrtr_local_nid); - if (!skb) - return NULL; - - buf = skb_put_zero(skb, pkt_len); - buf[0] = cpu_to_le32(QRTR_TYPE_BYE); - - return skb; -} - -static struct sk_buff *qrtr_alloc_del_client(struct sockaddr_qrtr *sq) -{ - const int pkt_len = 20; - struct sk_buff *skb; - __le32 *buf; - - skb = qrtr_alloc_ctrl_packet(QRTR_TYPE_DEL_CLIENT, pkt_len, -sq->sq_node, QRTR_NODE_BCAST); - if (!skb) - return NULL; - - buf = skb_put_zero(skb, pkt_len); - buf[0] = cpu_to_le32(QRTR_TYPE_DEL_CLIENT); - buf[1] = cpu_to_le32(sq->sq_node); - buf[2] = cpu_to_le32(sq->sq_port); + *pkt = skb_put_zero(skb, pkt_len); return skb; } @@ -337,6 +289,7 @@ static void qrtr_port_put(struct qrtr_sock *ipc); static void qrtr_node_rx_work(struct work_struct *work) { struct qrtr_node *node = container_of(work, struct qrtr_node, work); + struct qrtr_ctrl_pkt *pkt; struct sockaddr_qrtr dst; struct sockaddr_qrtr src; struct sk_buff *skb; @@ -372,10 +325,14 @@ static void qrtr_node_rx_work(struct work_struct *work) } if (confirm) { - skb = qrtr_alloc_resume_tx(dst_node, node->nid, dst_port); + skb = qrtr_alloc_ctrl_packet(&pkt); if (!skb) break; + pkt->cmd = cpu_to_le32(QRTR_TYPE_RESUME_TX); + pkt->client.node = cpu_to_le32(dst.sq_node); + pkt->client.port = cpu_to_le32(dst.sq_port); + if (qrtr_node_enqueue(node, skb, QRTR_TYPE_RESUME_TX, &dst, &src)) break; @@ -429,6 +386,7 @@ void qrtr_endpoint_unregister(struct qrtr_endpoint *ep) struct qrtr_node *node = ep->node; struct sockaddr_qrtr src = {AF_QIPCRTR, node->nid, QRTR_PORT_CTRL}; struct sockaddr_qrtr dst = {AF_QIPCRTR, qrtr_local_nid, QRTR_PORT_CTRL}; + struct qrtr_ctrl_pkt *pkt; struct sk_buff *skb; mutex_lock(&node->ep_lock); @@ -436,9 +394,11 @@ void qrtr_endpoint_unregister(struct qrtr_endpoint *ep) mutex_unl
[PATCH v2 7/7] net: qrtr: Support decoding incoming v2 packets
Add the necessary logic for decoding incoming messages of version 2 as well. Also make sure there's room for the bigger of version 1 and 2 headers in the code allocating skbs for outgoing messages. Signed-off-by: Bjorn Andersson --- Changes since v1: - Dropped __packed from struct qrtr_hdr_v2 net/qrtr/qrtr.c | 132 1 file changed, 94 insertions(+), 38 deletions(-) diff --git a/net/qrtr/qrtr.c b/net/qrtr/qrtr.c index 7bca6ec892a5..e458ece96d3d 100644 --- a/net/qrtr/qrtr.c +++ b/net/qrtr/qrtr.c @@ -20,14 +20,15 @@ #include "qrtr.h" -#define QRTR_PROTO_VER 1 +#define QRTR_PROTO_VER_1 1 +#define QRTR_PROTO_VER_2 3 /* auto-bind range */ #define QRTR_MIN_EPH_SOCKET 0x4000 #define QRTR_MAX_EPH_SOCKET 0x7fff /** - * struct qrtr_hdr - (I|R)PCrouter packet header + * struct qrtr_hdr_v1 - (I|R)PCrouter packet header version 1 * @version: protocol version * @type: packet type; one of QRTR_TYPE_* * @src_node_id: source node @@ -37,7 +38,7 @@ * @dst_node_id: destination node * @dst_port_id: destination port */ -struct qrtr_hdr { +struct qrtr_hdr_v1 { __le32 version; __le32 type; __le32 src_node_id; @@ -48,6 +49,32 @@ struct qrtr_hdr { __le32 dst_port_id; } __packed; +/** + * struct qrtr_hdr_v2 - (I|R)PCrouter packet header later versions + * @version: protocol version + * @type: packet type; one of QRTR_TYPE_* + * @flags: bitmask of QRTR_FLAGS_* + * @optlen: length of optional header data + * @size: length of packet, excluding this header and optlen + * @src_node_id: source node + * @src_port_id: source port + * @dst_node_id: destination node + * @dst_port_id: destination port + */ +struct qrtr_hdr_v2 { + u8 version; + u8 type; + u8 flags; + u8 optlen; + __le32 size; + __le16 src_node_id; + __le16 src_port_id; + __le16 dst_node_id; + __le16 dst_port_id; +}; + +#define QRTR_FLAGS_CONFIRM_RX BIT(0) + struct qrtr_cb { u32 src_node; u32 src_port; @@ -58,7 +85,8 @@ struct qrtr_cb { u8 confirm_rx; }; -#define QRTR_HDR_SIZE sizeof(struct qrtr_hdr) +#define QRTR_HDR_MAX_SIZE max_t(size_t, sizeof(struct qrtr_hdr_v1), \ + sizeof(struct qrtr_hdr_v2)) struct qrtr_sock { /* WARNING: sk must be the first member */ @@ -154,12 +182,12 @@ static int qrtr_node_enqueue(struct qrtr_node *node, struct sk_buff *skb, int type, struct sockaddr_qrtr *from, struct sockaddr_qrtr *to) { - struct qrtr_hdr *hdr; + struct qrtr_hdr_v1 *hdr; size_t len = skb->len; int rc = -ENODEV; - hdr = skb_push(skb, QRTR_HDR_SIZE); - hdr->version = cpu_to_le32(QRTR_PROTO_VER); + hdr = skb_push(skb, sizeof(*hdr)); + hdr->version = cpu_to_le32(QRTR_PROTO_VER_1); hdr->type = cpu_to_le32(type); hdr->src_node_id = cpu_to_le32(from->sq_node); hdr->src_port_id = cpu_to_le32(from->sq_port); @@ -224,52 +252,80 @@ static void qrtr_node_assign(struct qrtr_node *node, unsigned int nid) int qrtr_endpoint_post(struct qrtr_endpoint *ep, const void *data, size_t len) { struct qrtr_node *node = ep->node; - const struct qrtr_hdr *phdr = data; + const struct qrtr_hdr_v1 *v1; + const struct qrtr_hdr_v2 *v2; struct sk_buff *skb; struct qrtr_cb *cb; - unsigned int psize; unsigned int size; - unsigned int type; unsigned int ver; - unsigned int dst; + size_t hdrlen; - if (len < QRTR_HDR_SIZE || len & 3) + if (len & 3) return -EINVAL; - ver = le32_to_cpu(phdr->version); - size = le32_to_cpu(phdr->size); - type = le32_to_cpu(phdr->type); - dst = le32_to_cpu(phdr->dst_port_id); + skb = netdev_alloc_skb(NULL, len); + if (!skb) + return -ENOMEM; - psize = (size + 3) & ~3; + cb = (struct qrtr_cb *)skb->cb; - if (ver != QRTR_PROTO_VER) - return -EINVAL; + /* Version field in v1 is little endian, so this works for both cases */ + ver = *(u8*)data; - if (len != psize + QRTR_HDR_SIZE) - return -EINVAL; + switch (ver) { + case QRTR_PROTO_VER_1: + v1 = data; + hdrlen = sizeof(*v1); - if (dst != QRTR_PORT_CTRL && type != QRTR_TYPE_DATA) - return -EINVAL; + cb->type = le32_to_cpu(v1->type); + cb->src_node = le32_to_cpu(v1->src_node_id); + cb->src_port = le32_to_cpu(v1->src_port_id); + cb->confirm_rx = !!v1->confirm_rx; + cb->dst_node = le32_to_cpu(v1->dst_node_id); + cb->dst_port = le32_to_cpu(v1->dst_port_id); - skb = netdev_alloc_skb(NULL, len); - if (!skb) - return -ENOMEM; + size = le32_to_cpu(v1
[PATCH v2 6/7] net: qrtr: Use sk_buff->cb in receive path
Rather than parsing the header of incoming messages throughout the implementation do it once when we retrieve the message and store the relevant information in the "cb" member of the sk_buff. This allows us to, in a later commit, decode version 2 messages into this same structure. Signed-off-by: Bjorn Andersson --- Changes since v1: - None net/qrtr/qrtr.c | 70 - 1 file changed, 40 insertions(+), 30 deletions(-) diff --git a/net/qrtr/qrtr.c b/net/qrtr/qrtr.c index a84edba7b1ef..7bca6ec892a5 100644 --- a/net/qrtr/qrtr.c +++ b/net/qrtr/qrtr.c @@ -48,6 +48,16 @@ struct qrtr_hdr { __le32 dst_port_id; } __packed; +struct qrtr_cb { + u32 src_node; + u32 src_port; + u32 dst_node; + u32 dst_port; + + u8 type; + u8 confirm_rx; +}; + #define QRTR_HDR_SIZE sizeof(struct qrtr_hdr) struct qrtr_sock { @@ -216,6 +226,7 @@ int qrtr_endpoint_post(struct qrtr_endpoint *ep, const void *data, size_t len) struct qrtr_node *node = ep->node; const struct qrtr_hdr *phdr = data; struct sk_buff *skb; + struct qrtr_cb *cb; unsigned int psize; unsigned int size; unsigned int type; @@ -245,8 +256,15 @@ int qrtr_endpoint_post(struct qrtr_endpoint *ep, const void *data, size_t len) if (!skb) return -ENOMEM; - skb_reset_transport_header(skb); - skb_put_data(skb, data, len); + cb = (struct qrtr_cb *)skb->cb; + cb->src_node = le32_to_cpu(phdr->src_node_id); + cb->src_port = le32_to_cpu(phdr->src_port_id); + cb->dst_node = le32_to_cpu(phdr->dst_node_id); + cb->dst_port = le32_to_cpu(phdr->dst_port_id); + cb->type = type; + cb->confirm_rx = !!phdr->confirm_rx; + + skb_put_data(skb, data + QRTR_HDR_SIZE, size); skb_queue_tail(&node->rx_queue, skb); schedule_work(&node->work); @@ -295,26 +313,20 @@ static void qrtr_node_rx_work(struct work_struct *work) struct sk_buff *skb; while ((skb = skb_dequeue(&node->rx_queue)) != NULL) { - const struct qrtr_hdr *phdr; - u32 dst_node, dst_port; struct qrtr_sock *ipc; - u32 src_node; + struct qrtr_cb *cb; int confirm; - phdr = (const struct qrtr_hdr *)skb_transport_header(skb); - src_node = le32_to_cpu(phdr->src_node_id); - dst_node = le32_to_cpu(phdr->dst_node_id); - dst_port = le32_to_cpu(phdr->dst_port_id); - confirm = !!phdr->confirm_rx; + cb = (struct qrtr_cb *)skb->cb; + src.sq_node = cb->src_node; + src.sq_port = cb->src_port; + dst.sq_node = cb->dst_node; + dst.sq_port = cb->dst_port; + confirm = !!cb->confirm_rx; - src.sq_node = src_node; - src.sq_port = le32_to_cpu(phdr->src_port_id); - dst.sq_node = dst_node; - dst.sq_port = dst_port; + qrtr_node_assign(node, cb->src_node); - qrtr_node_assign(node, src_node); - - ipc = qrtr_port_lookup(dst_port); + ipc = qrtr_port_lookup(cb->dst_port); if (!ipc) { kfree_skb(skb); } else { @@ -604,7 +616,7 @@ static int qrtr_local_enqueue(struct qrtr_node *node, struct sk_buff *skb, struct sockaddr_qrtr *to) { struct qrtr_sock *ipc; - struct qrtr_hdr *phdr; + struct qrtr_cb *cb; ipc = qrtr_port_lookup(to->sq_port); if (!ipc || &ipc->sk == skb->sk) { /* do not send to self */ @@ -612,11 +624,9 @@ static int qrtr_local_enqueue(struct qrtr_node *node, struct sk_buff *skb, return -ENODEV; } - phdr = skb_push(skb, QRTR_HDR_SIZE); - skb_reset_transport_header(skb); - - phdr->src_node_id = cpu_to_le32(from->sq_node); - phdr->src_port_id = cpu_to_le32(from->sq_port); + cb = (struct qrtr_cb *)skb->cb; + cb->src_node = from->sq_node; + cb->src_port = from->sq_port; if (sock_queue_rcv_skb(&ipc->sk, skb)) { qrtr_port_put(ipc); @@ -750,9 +760,9 @@ static int qrtr_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, int flags) { DECLARE_SOCKADDR(struct sockaddr_qrtr *, addr, msg->msg_name); - const struct qrtr_hdr *phdr; struct sock *sk = sock->sk; struct sk_buff *skb; + struct qrtr_cb *cb; int copied, rc; lock_sock(sk); @@ -769,22 +779,22 @@ static int qrtr_recvmsg(struct socket *sock, struct msghdr *msg, return rc; } - phdr = (const struct qrtr_hdr *)skb_transport_header(skb); - copied = le32_to_cpu(phdr->size); + copied = skb->len; if (copied > size) { copied =
Re: [patch net-next 0/4] net: sched: get rid of cls_flower->egress_dev
Tue, Oct 10, 2017 at 11:46:22PM CEST, gerlitz...@gmail.com wrote: >On Wed, Oct 11, 2017 at 12:13 AM, Jiri Pirko wrote: >> Tue, Oct 10, 2017 at 07:24:21PM CEST, gerlitz...@gmail.com wrote: > >> Or, as I replied to you earlier, the issue you describe is totally >> unrelated to this patchset as you see the issue with the current net-next. > >Jiri, the point I wanted to make that if indeed there's a bug in mlx5 >or flower, we will have to fix it for 4.14 and then these bits would >have to be rebased when net-next is re-planted over net, I put "FWIW" >before that, so maybe it doesn't W so much, we'll see. The fix is still unrelated to this patchset.
Re: [PATCH v8 01/20] crypto: change transient busy return code to -EAGAIN
On Sat, Oct 07, 2017 at 10:51:42AM +0300, Gilad Ben-Yossef wrote: > On Sat, Oct 7, 2017 at 6:05 AM, Herbert Xu > wrote: > > On Tue, Sep 05, 2017 at 03:38:40PM +0300, Gilad Ben-Yossef wrote: > >> > >> diff --git a/crypto/algif_hash.c b/crypto/algif_hash.c > >> index 5e92bd2..3b3c154 100644 > >> --- a/crypto/algif_hash.c > >> +++ b/crypto/algif_hash.c > >> @@ -39,6 +39,20 @@ struct algif_hash_tfm { > >> bool has_key; > >> }; > >> > >> +/* Previous versions of crypto_* ops used to return -EBUSY > >> + * rather than -EAGAIN to indicate being tied up. The in > >> + * kernel API changed but we don't want to break the user > >> + * space API. As only the hash user interface exposed this > >> + * error ever to the user, do the translation here. > >> + */ > >> +static inline int crypto_user_err(int err) > >> +{ > >> + if (err == -EAGAIN) > >> + return -EBUSY; > >> + > >> + return err; > > > > I don't see the need to carry along this baggage. Does anyone > > in user-space actually rely on EBUSY? > > > I am not aware of anyone who does. I was just trying to avoid > changing the user ABI. > > Shall I roll a new revision without this patch? Yes please. I'd rather not carry this around for eternity unless it was actually required. Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [net-next V6 PATCH 0/5] New bpf cpumap type for XDP_REDIRECT
On 10/10/2017 05:47 AM, Jesper Dangaard Brouer wrote: > Introducing a new way to redirect XDP frames. Notice how no driver > changes are necessary given the design of XDP_REDIRECT. > > This redirect map type is called 'cpumap', as it allows redirection > XDP frames to remote CPUs. The remote CPU will do the SKB allocation > and start the network stack invocation on that CPU. > > This is a scalability and isolation mechanism, that allow separating > the early driver network XDP layer, from the rest of the netstack, and > assigning dedicated CPUs for this stage. The sysadm control/configure > the RX-CPU to NIC-RX queue (as usual) via procfs smp_affinity and how > many queues are configured via ethtool --set-channels. Benchmarks > show that a single CPU can handle approx 11Mpps. Thus, only assigning > two NIC RX-queues (and two CPUs) is sufficient for handling 10Gbit/s > wirespeed smallest packet 14.88Mpps. Reducing the number of queues > have the advantage that more packets being "bulk" available per hard > interrupt[1]. > > [1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf > > Use-cases: > > 1. End-host based pre-filtering for DDoS mitigation. This is fast >enough to allow software to see and filter all packets wirespeed. >Thus, no packets getting silently dropped by hardware. > > 2. Given NIC HW unevenly distributes packets across RX queue, this >mechanism can be used for redistribution load across CPUs. This >usually happens when HW is unaware of a new protocol. This >resembles RPS (Receive Packet Steering), just faster, but with more >responsibility placed on the BPF program for correct steering. Hi Jesper, Another (somewhat meta) comment about the performance benchmarks. In one of the original threads you showed that the XDP cpu map outperformed RPS in TCP_CRR netperf tests. It was significant iirc in the mpps range. But, with this series we will skip GRO. Do you have any idea how this looks with other tests such as TCP_STREAM? I'm trying to understand if this is something that can be used in the general case or is more for the special case and will have to be enabled/disabled by the orchestration layer depending on workload/network conditions. My intuition is the general case will be slower due to lack of GRO. If this is the case any ideas how we could add GRO? Not needed in the initial patchset but trying to see if the two are mutually exclusive. I don't off-hand see an easy way to pull GRO into this feature. Thanks, John
[PATCH net] net/ncsi: Don't limit vids based on hot_channel
Currently we drop any new VLAN ids if there are more than the current (or last used) channel can support. Most importantly this is a problem if no channel has been selected yet, resulting in a segfault. Secondly this does not necessarily reflect the capabilities of any other channels. Instead only drop a new VLAN id if we are already tracking the maximum allowed by the NCSI specification. Per-channel limits are already handled by ncsi_add_filter(), but add a message to set_one_vid() to make it obvious that the channel can not support any more VLAN ids. Signed-off-by: Samuel Mendoza-Jonas --- net/ncsi/internal.h| 1 + net/ncsi/ncsi-manage.c | 17 + 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h index af3d636534ef..d30f7bd741d0 100644 --- a/net/ncsi/internal.h +++ b/net/ncsi/internal.h @@ -286,6 +286,7 @@ struct ncsi_dev_priv { struct work_struct work;/* For channel management */ struct packet_type ptype; /* NCSI packet Rx handler */ struct list_headnode;/* Form NCSI device list */ +#define NCSI_MAX_VLAN_VIDS 15 struct list_headvlan_vids; /* List of active VLAN IDs */ }; diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c index 3fd3c39e6278..b6a449aa9d4b 100644 --- a/net/ncsi/ncsi-manage.c +++ b/net/ncsi/ncsi-manage.c @@ -732,6 +732,10 @@ static int set_one_vid(struct ncsi_dev_priv *ndp, struct ncsi_channel *nc, if (index < 0) { netdev_err(ndp->ndev.dev, "Failed to add new VLAN tag, error %d\n", index); + if (index == -ENOSPC) + netdev_err(ndp->ndev.dev, + "Channel %u already has all VLAN filters set\n", + nc->id); return -1; } @@ -1403,7 +1407,6 @@ static int ncsi_kick_channels(struct ncsi_dev_priv *ndp) int ncsi_vlan_rx_add_vid(struct net_device *dev, __be16 proto, u16 vid) { - struct ncsi_channel_filter *ncf; struct ncsi_dev_priv *ndp; unsigned int n_vids = 0; struct vlan_vid *vlan; @@ -1420,7 +1423,6 @@ int ncsi_vlan_rx_add_vid(struct net_device *dev, __be16 proto, u16 vid) } ndp = TO_NCSI_DEV_PRIV(nd); - ncf = ndp->hot_channel->filters[NCSI_FILTER_VLAN]; /* Add the VLAN id to our internal list */ list_for_each_entry_rcu(vlan, &ndp->vlan_vids, list) { @@ -1431,12 +1433,11 @@ int ncsi_vlan_rx_add_vid(struct net_device *dev, __be16 proto, u16 vid) return 0; } } - - if (n_vids >= ncf->total) { - netdev_info(dev, - "NCSI Channel supports up to %u VLAN tags but %u are already set\n", - ncf->total, n_vids); - return -EINVAL; + if (n_vids >= NCSI_MAX_VLAN_VIDS) { + netdev_warn(dev, + "tried to add vlan id %u but NCSI max already registered (%u)\n", + vid, NCSI_MAX_VLAN_VIDS); + return -ENOSPC; } vlan = kzalloc(sizeof(*vlan), GFP_KERNEL); -- 2.14.2
Re: High CPU load by native_queued_spin_lock_slowpath
I'm using ifb0 device for outgoing traffic. I have one bond0 interface with exit to the Internet, and 2 interfaces eth0 and eth2 to local users. ifb0 - for shaping Internet traffic from bond0 to eth2 or eth0. All outgoing traffic to the eth0 and eth2 redirecting to ifb0. > What about multiple ifb instead, one per RX queue ? You are offering to redirect traffic from every queue to personal ifb device? I do not quite understand. 2017-10-10 20:07 GMT+06:00 Eric Dumazet : > On Tue, 2017-10-10 at 18:00 +0600, Sergey K. wrote: >> I'm using Debian 9(stretch edition) kernel 4.9., hp dl385 g7 server >> with 32 cpu cores. NIC queues are tied to processor cores. Server is >> shaping traffic (iproute2 and htb discipline + skbinfo + ipset + ifb) >> and filtering some rules by iptables. >> >> At that moment, when traffic goes up about 1gbit/s cpu is very high >> loaded. Perf tool tells me that kernel module >> native_queued_spin_lock_slowpath loading cpu about 40%. >> >> After several hours of searching, I found that if I remove the htb >> discipline from ifb0, the high load goes down. >> Well, I think that problem with classify and shaping by htb. >> >> Who knows how to solve? > > You use a single ifb0 on the whole (multiqueue) device for ingress ? > > What about multiple ifb instead, one per RX queue ? > > Alternative is to reduce contention and use a single RX queue. > >
Re: [net-next V6 PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
On Wed, 11 Oct 2017 00:48:40 +0200 Daniel Borkmann wrote: > On 10/10/2017 02:47 PM, Jesper Dangaard Brouer wrote: > [...] > > +static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) > > +{ > > + struct bpf_cpu_map *cmap; > > + int err = -ENOMEM; > > + u64 cost; > > + int ret; > > + > > + if (!capable(CAP_SYS_ADMIN)) > > + return ERR_PTR(-EPERM); > > + > > + /* check sanity of attributes */ > > + if (attr->max_entries == 0 || attr->key_size != 4 || > > + attr->value_size != 4 || attr->map_flags & ~BPF_F_NUMA_NODE) > > + return ERR_PTR(-EINVAL); > > + > > + cmap = kzalloc(sizeof(*cmap), GFP_USER); > > + if (!cmap) > > + return ERR_PTR(-ENOMEM); > > + > > + /* mandatory map attributes */ > > + cmap->map.map_type = attr->map_type; > > + cmap->map.key_size = attr->key_size; > > + cmap->map.value_size = attr->value_size; > > + cmap->map.max_entries = attr->max_entries; > > + cmap->map.map_flags = attr->map_flags; > > + cmap->map.numa_node = bpf_map_attr_numa_node(attr); > > + > > + /* Pre-limit array size based on NR_CPUS, not final CPU check */ > > + if (cmap->map.max_entries > NR_CPUS) > > + return ERR_PTR(-E2BIG); > > We still have a leak here, meaning kfree(cmap) is missing on above error. Darn... yes, I introduced this in this V6 as I moved the check. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
[PATCH] r8169: only enable PCI wakeups when WOL is active
rtl_init_one() currently enables PCI wakeups if the ethernet device is found to be WOL-capable. There is no need to do this when rtl8169_set_wol() will correctly enable or disable the same wakeup flag when WOL is activated/deactivated. This works around an ACPI DSDT bug which prevents the Acer laptop models Aspire ES1-533, Aspire ES1-732, PackardBell ENTE69AP and Gateway NE533 from entering S3 suspend - even when no ethernet cable is connected. On these platforms, the DSDT says that GPE08 is a wakeup source for ethernet, but this GPE fires as soon as the system goes into suspend, waking the system up immediately. Having the wakeup normally disabled avoids this issue in the default case. With this change, WOL will continue to be unusable on these platforms (it will instantly wake up if WOL is later enabled by the user) but we do not expect this to be a commonly used feature on these consumer laptops. We have separately determined that WOL works fine without any ACPI GPEs enabled during sleep, so a DSDT fix or override would be possible to make WOL work. Signed-off-by: Daniel Drake --- drivers/net/ethernet/realtek/r8169.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index e03fcf914690..a3c949ea7d1a 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -8491,8 +8491,6 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) rtl8168_driver_start(tp); } - device_set_wakeup_enable(&pdev->dev, tp->features & RTL_FEATURE_WOL); - if (pci_dev_run_wake(pdev)) pm_runtime_put_noidle(&pdev->dev); -- 2.11.0
[PATCH net-next 6/7] net: qualcomm: rmnet: Convert the muxed endpoint to hlist
Rather than using a static array, use a hlist to store the muxed endpoints and use the mux id to query the rmnet_device. This is useful as usually very few mux ids are used. Signed-off-by: Subash Abhinov Kasiviswanathan Cc: Dan Williams --- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 75 -- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 4 +- .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 17 +++-- .../ethernet/qualcomm/rmnet/rmnet_map_command.c| 4 +- drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c| 15 +++-- drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h| 6 +- 6 files changed, 68 insertions(+), 53 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c index 96058bb..b5fe3f4 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c @@ -61,18 +61,6 @@ static int rmnet_is_real_dev_registered(const struct net_device *real_dev) return rtnl_dereference(real_dev->rx_handler_data); } -static struct rmnet_endpoint* -rmnet_get_endpoint(struct net_device *dev, int config_id) -{ - struct rmnet_endpoint *ep; - struct rmnet_port *port; - - port = rmnet_get_port_rtnl(dev); - ep = &port->muxed_ep[config_id]; - - return ep; -} - static int rmnet_unregister_real_device(struct net_device *real_dev, struct rmnet_port *port) { @@ -93,7 +81,7 @@ static int rmnet_unregister_real_device(struct net_device *real_dev, static int rmnet_register_real_device(struct net_device *real_dev) { struct rmnet_port *port; - int rc; + int rc, entry; ASSERT_RTNL(); @@ -114,26 +102,13 @@ static int rmnet_register_real_device(struct net_device *real_dev) /* hold on to real dev for MAP data */ dev_hold(real_dev); + for (entry = 0; entry < RMNET_MAX_LOGICAL_EP; entry++) + INIT_HLIST_HEAD(&port->muxed_ep[entry]); + netdev_dbg(real_dev, "registered with rmnet\n"); return 0; } -static void rmnet_set_endpoint_config(struct net_device *dev, - u8 mux_id, struct net_device *egress_dev) -{ - struct rmnet_endpoint *ep; - - netdev_dbg(dev, "id %d dev %s\n", mux_id, egress_dev->name); - - ep = rmnet_get_endpoint(dev, mux_id); - /* This config is cleared on every set, so its ok to not -* clear it on a device delete. -*/ - memset(ep, 0, sizeof(struct rmnet_endpoint)); - ep->egress_dev = egress_dev; - ep->mux_id = mux_id; -} - static int rmnet_newlink(struct net *src_net, struct net_device *dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) @@ -145,6 +120,7 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, RMNET_EGRESS_FORMAT_MAP; struct net_device *real_dev; int mode = RMNET_EPMODE_VND; + struct rmnet_endpoint *ep; struct rmnet_port *port; int err = 0; u16 mux_id; @@ -156,6 +132,10 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, if (!data[IFLA_VLAN_ID]) return -EINVAL; + ep = kzalloc(sizeof(*ep), GFP_ATOMIC); + if (!ep) + return -ENOMEM; + mux_id = nla_get_u16(data[IFLA_VLAN_ID]); err = rmnet_register_real_device(real_dev); @@ -163,7 +143,7 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, goto err0; port = rmnet_get_port_rtnl(real_dev); - err = rmnet_vnd_newlink(mux_id, dev, port, real_dev); + err = rmnet_vnd_newlink(mux_id, dev, port, real_dev, ep); if (err) goto err1; @@ -177,11 +157,11 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, port->ingress_data_format = ingress_format; port->rmnet_mode = mode; - rmnet_set_endpoint_config(real_dev, mux_id, dev); + hlist_add_head_rcu(&ep->hlnode, &port->muxed_ep[mux_id]); return 0; err2: - rmnet_vnd_dellink(mux_id, port); + rmnet_vnd_dellink(mux_id, port, ep); err1: rmnet_unregister_real_device(real_dev, port); err0: @@ -191,6 +171,7 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, static void rmnet_dellink(struct net_device *dev, struct list_head *head) { struct net_device *real_dev; + struct rmnet_endpoint *ep; struct rmnet_port *port; u8 mux_id; @@ -204,8 +185,15 @@ static void rmnet_dellink(struct net_device *dev, struct list_head *head) port = rmnet_get_port_rtnl(real_dev); mux_id = rmnet_vnd_get_mux(dev); - rmnet_vnd_dellink(mux_id, port); netdev_upper_dev_unlink(dev, real_dev); + + ep = rmnet_get_endpoint(por
[PATCH net-next 5/7] net: qualcomm: rmnet: Remove duplicate setting of rmnet_devices
The rmnet_devices information is already stored in muxed_ep, so storing this in rmnet_devices[] again is redundant. Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 1 - drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c| 8 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h index c5f5c6d..123ccf4 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h @@ -33,7 +33,6 @@ struct rmnet_port { struct rmnet_endpoint muxed_ep[RMNET_MAX_LOGICAL_EP]; u32 ingress_data_format; u32 egress_data_format; - struct net_device *rmnet_devices[RMNET_MAX_LOGICAL_EP]; u8 nr_rmnet_devs; u8 rmnet_mode; }; diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c index 4ca59a4..8b8497b 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c @@ -105,12 +105,12 @@ int rmnet_vnd_newlink(u8 id, struct net_device *rmnet_dev, struct rmnet_priv *priv; int rc; - if (port->rmnet_devices[id]) + if (port->muxed_ep[id].egress_dev) return -EINVAL; rc = register_netdevice(rmnet_dev); if (!rc) { - port->rmnet_devices[id] = rmnet_dev; + port->muxed_ep[id].egress_dev = rmnet_dev; port->nr_rmnet_devs++; rmnet_dev->rtnl_link_ops = &rmnet_link_ops; @@ -127,10 +127,10 @@ int rmnet_vnd_newlink(u8 id, struct net_device *rmnet_dev, int rmnet_vnd_dellink(u8 id, struct rmnet_port *port) { - if (id >= RMNET_MAX_LOGICAL_EP || !port->rmnet_devices[id]) + if (id >= RMNET_MAX_LOGICAL_EP || !port->muxed_ep[id].egress_dev) return -EINVAL; - port->rmnet_devices[id] = NULL; + port->muxed_ep[id].egress_dev = NULL; port->nr_rmnet_devs--; return 0; } -- 1.9.1
[PATCH net-next 2/7] net: qualcomm: rmnet: Remove some unused defines
Most of these constants were used in the initial patchset where custom netlink configuration was used and hence are no longer relevant. Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h | 8 1 file changed, 8 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h index 7967198..49102f9 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h @@ -19,23 +19,15 @@ #define RMNET_TX_QUEUE_LEN 1000 /* Constants */ -#define RMNET_EGRESS_FORMAT__RESERVED__ BIT(0) #define RMNET_EGRESS_FORMAT_MAP BIT(1) #define RMNET_EGRESS_FORMAT_AGGREGATION BIT(2) #define RMNET_EGRESS_FORMAT_MUXING BIT(3) -#define RMNET_EGRESS_FORMAT_MAP_CKSUMV3 BIT(4) -#define RMNET_EGRESS_FORMAT_MAP_CKSUMV4 BIT(5) -#define RMNET_INGRESS_FIX_ETHERNET BIT(0) #define RMNET_INGRESS_FORMAT_MAPBIT(1) #define RMNET_INGRESS_FORMAT_DEAGGREGATION BIT(2) #define RMNET_INGRESS_FORMAT_DEMUXING BIT(3) #define RMNET_INGRESS_FORMAT_MAP_COMMANDS BIT(4) -#define RMNET_INGRESS_FORMAT_MAP_CKSUMV3BIT(5) -#define RMNET_INGRESS_FORMAT_MAP_CKSUMV4BIT(6) -/* Pass the frame up the stack with no modifications to skb->dev */ -#define RMNET_EPMODE_NONE (0) /* Replace skb->dev to a virtual rmnet device and pass up the stack */ #define RMNET_EPMODE_VND (1) /* Pass the frame directly to another device with dev_queue_xmit() */ -- 1.9.1
[PATCH net-next 4/7] net: qualcomm: rmnet: Remove duplicate setting of rmnet private info
The end point is set twice in the local_ep as well as the mux_id and the real_dev in the rmnet private structure. Remove the local_ep. While these elements are equivalent, rmnet_endpoint will be used only as part of the rmnet_port for muxed scenarios in VND mode. Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 10 ++ drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 4 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 18 ++ drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.h | 3 +-- drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c | 19 ++- drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h | 1 - 6 files changed, 15 insertions(+), 40 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c index 85fce9c..96058bb 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c @@ -67,13 +67,8 @@ static int rmnet_is_real_dev_registered(const struct net_device *real_dev) struct rmnet_endpoint *ep; struct rmnet_port *port; - if (!rmnet_is_real_dev_registered(dev)) { - ep = rmnet_vnd_get_endpoint(dev); - } else { - port = rmnet_get_port_rtnl(dev); - - ep = &port->muxed_ep[config_id]; - } + port = rmnet_get_port_rtnl(dev); + ep = &port->muxed_ep[config_id]; return ep; } @@ -183,7 +178,6 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, port->rmnet_mode = mode; rmnet_set_endpoint_config(real_dev, mux_id, dev); - rmnet_set_endpoint_config(dev, mux_id, real_dev); return 0; err2: diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h index 03d473f..c5f5c6d 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h @@ -20,9 +20,6 @@ #define RMNET_MAX_LOGICAL_EP 255 -/* Information about the next device to deliver the packet to. - * Exact usage of this parameter depends on the rmnet_mode. - */ struct rmnet_endpoint { u8 mux_id; struct net_device *egress_dev; @@ -44,7 +41,6 @@ struct rmnet_port { extern struct rtnl_link_ops rmnet_link_ops; struct rmnet_priv { - struct rmnet_endpoint local_ep; u8 mux_id; struct net_device *real_dev; }; diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c index 86e37cc..e0802d3 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c @@ -116,8 +116,7 @@ static void rmnet_set_skb_proto(struct sk_buff *skb) } static int rmnet_map_egress_handler(struct sk_buff *skb, - struct rmnet_port *port, - struct rmnet_endpoint *ep, + struct rmnet_port *port, u8 mux_id, struct net_device *orig_dev) { int required_headroom, additional_header_len; @@ -136,10 +135,10 @@ static int rmnet_map_egress_handler(struct sk_buff *skb, return RMNET_MAP_CONSUMED; if (port->egress_data_format & RMNET_EGRESS_FORMAT_MUXING) { - if (ep->mux_id == 0xff) + if (mux_id == 0xff) map_header->mux_id = 0; else - map_header->mux_id = ep->mux_id; + map_header->mux_id = mux_id; } skb->protocol = htons(ETH_P_MAP); @@ -176,14 +175,17 @@ rx_handler_result_t rmnet_rx_handler(struct sk_buff **pskb) * for egress device configured in logical endpoint. Packet is then transmitted * on the egress device. */ -void rmnet_egress_handler(struct sk_buff *skb, - struct rmnet_endpoint *ep) +void rmnet_egress_handler(struct sk_buff *skb) { struct net_device *orig_dev; struct rmnet_port *port; + struct rmnet_priv *priv; + u8 mux_id; orig_dev = skb->dev; - skb->dev = ep->egress_dev; + priv = netdev_priv(orig_dev); + skb->dev = priv->real_dev; + mux_id = priv->mux_id; port = rmnet_get_port(skb->dev); if (!port) { @@ -192,7 +194,7 @@ void rmnet_egress_handler(struct sk_buff *skb, } if (port->egress_data_format & RMNET_EGRESS_FORMAT_MAP) { - switch (rmnet_map_egress_handler(skb, port, ep, orig_dev)) { + switch (rmnet_map_egress_handler(skb, port, mux_id, orig_dev)) { case RMNET_MAP_CONSUMED: return; diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.h index f2638cf..3537e
[PATCH net-next 7/7] net: qualcomm: rmnet: Implement bridge mode
Add support to bridge two devices which can send multiplexing and aggregation (MAP) data. This is done only when the data itself is not going to be consumed in the stack but is being passed on to a different endpoint. This is mainly used for testing. Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 93 +- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 7 +- .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 18 + drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c| 2 + 4 files changed, 118 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c index b5fe3f4..71bee1a 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c @@ -109,6 +109,36 @@ static int rmnet_register_real_device(struct net_device *real_dev) return 0; } +static void rmnet_unregister_bridge(struct net_device *dev, + struct rmnet_port *port) +{ + struct net_device *rmnet_dev, *bridge_dev; + struct rmnet_port *bridge_port; + + if (port->rmnet_mode != RMNET_EPMODE_BRIDGE) + return; + + /* bridge slave handling */ + if (!port->nr_rmnet_devs) { + rmnet_dev = netdev_master_upper_dev_get_rcu(dev); + netdev_upper_dev_unlink(dev, rmnet_dev); + + bridge_dev = port->bridge_ep; + + bridge_port = rmnet_get_port_rtnl(bridge_dev); + bridge_port->bridge_ep = NULL; + bridge_port->rmnet_mode = RMNET_EPMODE_VND; + } else { + bridge_dev = port->bridge_ep; + + bridge_port = rmnet_get_port_rtnl(bridge_dev); + rmnet_dev = netdev_master_upper_dev_get_rcu(bridge_dev); + netdev_upper_dev_unlink(bridge_dev, rmnet_dev); + + rmnet_unregister_real_device(bridge_dev, bridge_port); + } +} + static int rmnet_newlink(struct net *src_net, struct net_device *dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) @@ -190,10 +220,10 @@ static void rmnet_dellink(struct net_device *dev, struct list_head *head) ep = rmnet_get_endpoint(port, mux_id); if (ep) { hlist_del_init_rcu(&ep->hlnode); + rmnet_unregister_bridge(dev, port); rmnet_vnd_dellink(mux_id, port, ep); kfree(ep); } - rmnet_unregister_real_device(real_dev, port); unregister_netdevice_queue(dev, head); @@ -237,6 +267,8 @@ static void rmnet_force_unassociate_device(struct net_device *dev) d.port = port; rcu_read_lock(); + rmnet_unregister_bridge(dev, port); + netdev_walk_all_lower_dev_rcu(real_dev, rmnet_dev_walk_unreg, &d); rcu_read_unlock(); unregister_netdevice_many(&list); @@ -321,6 +353,65 @@ struct rmnet_endpoint *rmnet_get_endpoint(struct rmnet_port *port, u8 mux_id) return NULL; } +int rmnet_add_bridge(struct net_device *rmnet_dev, +struct net_device *slave_dev, +struct netlink_ext_ack *extack) +{ + struct rmnet_priv *priv = netdev_priv(rmnet_dev); + struct net_device *real_dev = priv->real_dev; + struct rmnet_port *port, *slave_port; + int err; + + port = rmnet_get_port(real_dev); + + /* If there is more than one rmnet dev attached, its probably being +* used for muxing. Skip the briding in that case +*/ + if (port->nr_rmnet_devs > 1) + return -EINVAL; + + if (rmnet_is_real_dev_registered(slave_dev)) + return -EBUSY; + + err = rmnet_register_real_device(slave_dev); + if (err) + return -EBUSY; + + err = netdev_master_upper_dev_link(slave_dev, rmnet_dev, NULL, NULL, + extack); + if (err) + return -EINVAL; + + slave_port = rmnet_get_port(slave_dev); + slave_port->rmnet_mode = RMNET_EPMODE_BRIDGE; + slave_port->bridge_ep = real_dev; + + port->rmnet_mode = RMNET_EPMODE_BRIDGE; + port->bridge_ep = slave_dev; + + netdev_dbg(slave_dev, "registered with rmnet as slave\n"); + return 0; +} + +int rmnet_del_bridge(struct net_device *rmnet_dev, +struct net_device *slave_dev) +{ + struct rmnet_priv *priv = netdev_priv(rmnet_dev); + struct net_device *real_dev = priv->real_dev; + struct rmnet_port *port, *slave_port; + + port = rmnet_get_port(real_dev); + port->rmnet_mode = RMNET_EPMODE_VND; + port->bridge_ep = NULL; + + netdev_upper_dev_unlink(slave_dev, rmnet_dev); + slave_port = rmnet_get_port(slave_dev); + rmnet_unregister_real_device(slave_dev, slave_po
[PATCH net-next 1/7] net: qualcomm: rmnet: Remove existing logic for bridge mode
This will be rewritten in the following patches. Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 1 - .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 77 +++--- 2 files changed, 9 insertions(+), 69 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h index dde4e9f..0b0c5a7 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h @@ -34,7 +34,6 @@ struct rmnet_endpoint { */ struct rmnet_port { struct net_device *dev; - struct rmnet_endpoint local_ep; struct rmnet_endpoint muxed_ep[RMNET_MAX_LOGICAL_EP]; u32 ingress_data_format; u32 egress_data_format; diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c index 540c762..b50f401 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c @@ -44,56 +44,18 @@ static void rmnet_set_skb_proto(struct sk_buff *skb) /* Generic handler */ static rx_handler_result_t -rmnet_bridge_handler(struct sk_buff *skb, struct rmnet_endpoint *ep) +rmnet_deliver_skb(struct sk_buff *skb) { - if (!ep->egress_dev) - kfree_skb(skb); - else - rmnet_egress_handler(skb, ep); + skb_reset_transport_header(skb); + skb_reset_network_header(skb); + rmnet_vnd_rx_fixup(skb, skb->dev); + skb->pkt_type = PACKET_HOST; + skb_set_mac_header(skb, 0); + netif_receive_skb(skb); return RX_HANDLER_CONSUMED; } -static rx_handler_result_t -rmnet_deliver_skb(struct sk_buff *skb, struct rmnet_endpoint *ep) -{ - switch (ep->rmnet_mode) { - case RMNET_EPMODE_NONE: - return RX_HANDLER_PASS; - - case RMNET_EPMODE_BRIDGE: - return rmnet_bridge_handler(skb, ep); - - case RMNET_EPMODE_VND: - skb_reset_transport_header(skb); - skb_reset_network_header(skb); - rmnet_vnd_rx_fixup(skb, skb->dev); - - skb->pkt_type = PACKET_HOST; - skb_set_mac_header(skb, 0); - netif_receive_skb(skb); - return RX_HANDLER_CONSUMED; - - default: - kfree_skb(skb); - return RX_HANDLER_CONSUMED; - } -} - -static rx_handler_result_t -rmnet_ingress_deliver_packet(struct sk_buff *skb, -struct rmnet_port *port) -{ - if (!port) { - kfree_skb(skb); - return RX_HANDLER_CONSUMED; - } - - skb->dev = port->local_ep.egress_dev; - - return rmnet_deliver_skb(skb, &port->local_ep); -} - /* MAP handler */ static rx_handler_result_t @@ -130,7 +92,7 @@ static void rmnet_set_skb_proto(struct sk_buff *skb) skb_pull(skb, sizeof(struct rmnet_map_header)); skb_trim(skb, len); rmnet_set_skb_proto(skb); - return rmnet_deliver_skb(skb, ep); + return rmnet_deliver_skb(skb); } static rx_handler_result_t @@ -204,29 +166,8 @@ rx_handler_result_t rmnet_rx_handler(struct sk_buff **pskb) dev = skb->dev; port = rmnet_get_port(dev); - if (port->ingress_data_format & RMNET_INGRESS_FORMAT_MAP) { + if (port->ingress_data_format & RMNET_INGRESS_FORMAT_MAP) rc = rmnet_map_ingress_handler(skb, port); - } else { - switch (ntohs(skb->protocol)) { - case ETH_P_MAP: - if (port->local_ep.rmnet_mode == - RMNET_EPMODE_BRIDGE) { - rc = rmnet_ingress_deliver_packet(skb, port); - } else { - kfree_skb(skb); - rc = RX_HANDLER_CONSUMED; - } - break; - - case ETH_P_IP: - case ETH_P_IPV6: - rc = rmnet_ingress_deliver_packet(skb, port); - break; - - default: - rc = RX_HANDLER_PASS; - } - } return rc; } -- 1.9.1
[PATCH net-next 3/7] net: qualcomm: rmnet: Move rmnet_mode to rmnet_port
Mode information on the real device makes it easier to route packets to rmnet device or bridged device based on the configuration. Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 12 +--- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 2 +- drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 3 +-- 3 files changed, 7 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c index 8403eea..85fce9c 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c @@ -124,20 +124,17 @@ static int rmnet_register_real_device(struct net_device *real_dev) } static void rmnet_set_endpoint_config(struct net_device *dev, - u8 mux_id, u8 rmnet_mode, - struct net_device *egress_dev) + u8 mux_id, struct net_device *egress_dev) { struct rmnet_endpoint *ep; - netdev_dbg(dev, "id %d mode %d dev %s\n", - mux_id, rmnet_mode, egress_dev->name); + netdev_dbg(dev, "id %d dev %s\n", mux_id, egress_dev->name); ep = rmnet_get_endpoint(dev, mux_id); /* This config is cleared on every set, so its ok to not * clear it on a device delete. */ memset(ep, 0, sizeof(struct rmnet_endpoint)); - ep->rmnet_mode = rmnet_mode; ep->egress_dev = egress_dev; ep->mux_id = mux_id; } @@ -183,9 +180,10 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, ingress_format, egress_format); port->egress_data_format = egress_format; port->ingress_data_format = ingress_format; + port->rmnet_mode = mode; - rmnet_set_endpoint_config(real_dev, mux_id, mode, dev); - rmnet_set_endpoint_config(dev, mux_id, mode, real_dev); + rmnet_set_endpoint_config(real_dev, mux_id, dev); + rmnet_set_endpoint_config(dev, mux_id, real_dev); return 0; err2: diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h index 0b0c5a7..03d473f 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h @@ -24,7 +24,6 @@ * Exact usage of this parameter depends on the rmnet_mode. */ struct rmnet_endpoint { - u8 rmnet_mode; u8 mux_id; struct net_device *egress_dev; }; @@ -39,6 +38,7 @@ struct rmnet_port { u32 egress_data_format; struct net_device *rmnet_devices[RMNET_MAX_LOGICAL_EP]; u8 nr_rmnet_devs; + u8 rmnet_mode; }; extern struct rtnl_link_ops rmnet_link_ops; diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c index b50f401..86e37cc 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c @@ -205,8 +205,7 @@ void rmnet_egress_handler(struct sk_buff *skb, } } - if (ep->rmnet_mode == RMNET_EPMODE_VND) - rmnet_vnd_tx_fixup(skb, orig_dev); + rmnet_vnd_tx_fixup(skb, orig_dev); dev_queue_xmit(skb); } -- 1.9.1
[PATCH net-next 0/7] Rewrite some existing functionality
This series fixes some of the broken rmnet functionality. Bridge mode is re-written and made useable and the muxed_ep is converted to hlist. Patches 1-5 are cleanups in preparation for these changes. Patch 6 does the hlist conversion. Patch 7 has the implementation of the rmnet bridge mode. Subash Abhinov Kasiviswanathan (7): net: qualcomm: rmnet: Remove existing logic for bridge mode net: qualcomm: rmnet: Remove some unused defines net: qualcomm: rmnet: Move rmnet_mode to rmnet_port net: qualcomm: rmnet: Remove duplicate setting of rmnet private info net: qualcomm: rmnet: Remove duplicate setting of rmnet_devices net: qualcomm: rmnet: Convert the muxed endpoint to hlist net: qualcomm: rmnet: Implement bridge mode drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 166 - drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 19 +-- .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 131 ++-- .../net/ethernet/qualcomm/rmnet/rmnet_handlers.h | 3 +- .../ethernet/qualcomm/rmnet/rmnet_map_command.c| 4 +- .../net/ethernet/qualcomm/rmnet/rmnet_private.h| 8 - drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c| 36 ++--- drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h| 7 +- 8 files changed, 204 insertions(+), 170 deletions(-) -- 1.9.1
Re: [PATCH 0/4] RCU: introduce noref debug
On Mon, Oct 09, 2017 at 06:53:12PM +0200, Paolo Abeni wrote: > On Fri, 2017-10-06 at 09:34 -0700, Paul E. McKenney wrote: > > On Fri, Oct 06, 2017 at 05:10:09PM +0200, Paolo Abeni wrote: > > > Hi, > > > > > > On Fri, 2017-10-06 at 06:34 -0700, Paul E. McKenney wrote: > > > > On Fri, Oct 06, 2017 at 02:57:45PM +0200, Paolo Abeni wrote: > > > > > The networking subsystem is currently using some kind of long-lived > > > > > RCU-protected, references to avoid the overhead of full book-keeping. > > > > > > > > > > Such references - skb_dst() noref - are stored inside the skbs and > > > > > can be > > > > > moved across relevant slices of the network stack, with the users > > > > > being in charge of properly clearing the relevant skb - or properly > > > > > refcount > > > > > the related dst references - before the skb escapes the RCU section. > > > > > > > > > > We currently don't have any deterministic debug infrastructure to > > > > > check > > > > > the dst noref usages - and the introduction of others noref artifact > > > > > is > > > > > currently under discussion. > > > > > > > > > > This series tries to tackle the above introducing an RCU debug > > > > > infrastructure > > > > > aimed at spotting incorrect noref pointer usage, in patch one. The > > > > > infrastructure is small and must be explicitly enabled via a newly > > > > > introduced > > > > > build option. > > > > > > > > > > Patch two uses such infrastructure to track dst noref usage in the > > > > > networking > > > > > stack. > > > > > > > > > > Patch 3 and 4 are bugfixes for small buglet found running this > > > > > infrastructure > > > > > on basic scenarios. > > > > > > Thank you for the prompt reply! > > > > > > > > This patchset does not look like it handles rcu_read_lock() nesting. > > > > For example, given code like this: > > > > > > > > void foo(void) > > > > { > > > > rcu_read_lock(); > > > > rcu_track_noref(&key2, &noref2, true); > > > > do_something(); > > > > rcu_track_noref(&key2, &noref2, false); > > > > rcu_read_unlock(); > > > > } > > > > > > > > void bar(void) > > > > { > > > > rcu_read_lock(); > > > > rcu_track_noref(&key1, &noref1, true); > > > > do_something_more(); > > > > foo(); > > > > do_something_else(); > > > > rcu_track_noref(&key1, &noref1, false); > > > > rcu_read_unlock(); > > > > } > > > > > > > > void grill(void) > > > > { > > > > foo(); > > > > } > > > > > > > > It looks like foo()'s rcu_read_unlock() will complain about key1. > > > > You could remove foo()'s rcu_read_lock() and rcu_read_unlock(), but > > > > that will break the call from grill(). > > > > > > Actually the code should cope correctly with your example; when foo()'s > > > rcu_read_unlock() is called, 'cache' contains: > > > > > > { { &key1, &noref1, 1}, // ... > > > > > > and when the related __rcu_check_noref() is invoked preempt_count() is > > > 2 - because the check is called before decreasing the preempt counter. > > > > > > In the main loop inside __rcu_check_noref() we will hit always the > > > 'continue' statement because 'cache->store[i].nesting != nesting', so > > > no warn will be triggered. > > > > You are right, it was too early, and my example wasn't correct. How > > about this one? > > > > void foo(void (*f)(struct s *sp), struct s **spp) > > { > > rcu_read_lock(); > > rcu_track_noref(&key2, &noref2, true); > > f(spp); > > rcu_track_noref(&key2, &noref2, false); > > rcu_read_unlock(); > > } > > > > void barcb(struct s **spp) > > { > > *spp = &noref3; > > rcu_track_noref(&key3, *spp, true); > > } > > > > void bar(void) > > { > > struct s *sp; > > > > rcu_read_lock(); > > rcu_track_noref(&key1, &noref1, true); > > do_something_more(); > > foo(barcb, &sp); > > do_something_else(sp); > > rcu_track_noref(&key3, sp, false); > > rcu_track_noref(&key1, &noref1, false); > > rcu_read_unlock(); > > } > > > > void grillcb(struct s **spp) > > { > > *spp > > } > > > > void grill(void) > > { > > foo(); > > } > > You are right: this will generate a splat, even if the code it safe. > The false positive can be avoided looking for leaked references only in > the outermost rcu unlook. I did a previous implementation performing > such check, but it emitted very generic splat so I tried to be more > strict. The latter choice allowed to find/do 3/4. > > What about using save_stack_trace() in rcu_track_noref(, true) and > reporting such stack trace when the check in the outer
ipsec: Fix dst leak in xfrm_bundle_create().
If we cannot find a suitable inner_mode value, we will leak the currently allocated 'xdst'. The fix is to make sure it is linked into the chain before erroring out. Signed-off-by: David S. Miller --- Steffen, I found this via visual inspection. Please double check my work before applying this :-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index f06253969972..2746b62a8944 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -1573,6 +1573,14 @@ static struct dst_entry *xfrm_bundle_create(struct xfrm_policy *policy, goto put_states; } + if (!dst_prev) + dst0 = dst1; + else + /* Ref count is taken during xfrm_alloc_dst() +* No need to do dst_clone() on dst1 +*/ + dst_prev->child = dst1; + if (xfrm[i]->sel.family == AF_UNSPEC) { inner_mode = xfrm_ip2inner_mode(xfrm[i], xfrm_af2proto(family)); @@ -1584,14 +1592,6 @@ static struct dst_entry *xfrm_bundle_create(struct xfrm_policy *policy, } else inner_mode = xfrm[i]->inner_mode; - if (!dst_prev) - dst0 = dst1; - else - /* Ref count is taken during xfrm_alloc_dst() -* No need to do dst_clone() on dst1 -*/ - dst_prev->child = dst1; - xdst->route = dst; dst_copy_metrics(dst1, dst);
Re: [PATCH net 2/2] net: call cgroup_sk_alloc() earlier in sk_clone_lock()
From: Eric Dumazet Date: Tue, 10 Oct 2017 19:12:33 -0700 > If for some reason, the newly allocated child need to be freed, > we will call cgroup_put() (via sk_free_unlock_clone()) while the > corresponding cgroup_get() was not yet done, and we will free memory > too soon. > > Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning > sockets") > Signed-off-by: Eric Dumazet Applied and queued up for -stable.
Re: [PATCH net 1/2] Revert "net: defer call to cgroup_sk_alloc()"
From: Eric Dumazet Date: Tue, 10 Oct 2017 19:12:32 -0700 > This reverts commit fbb1fb4ad415cb31ce944f65a5ca700aaf73a227. > > This was not the proper fix, lets cleanly revert it, so that > following patch can be carried to stable versions. > > sock_cgroup_ptr() callers do not expect a NULL return value. > > Signed-off-by: Eric Dumazet Applied.
Re: [next-queue PATCH v5 0/5] TSN: Add qdisc based config interface for CBS
From: Vinicius Costa Gomes Date: Tue, 10 Oct 2017 17:43:55 -0700 > Changes since v4: > - Added a software implementation of the CBS algorithm; This series now looks fine to me. I'll give others a chance to review it.
Re: [PATCH RFC 0/3] tun zerocopy stats
On 2017年10月11日 03:11, Willem de Bruijn wrote: On Tue, Oct 10, 2017 at 1:39 PM, David Miller wrote: From: Willem de Bruijn Date: Tue, 10 Oct 2017 11:29:33 -0400 If there is a way to expose these stats through vhost_net directly, instead of through tun, that may be better. But I did not see a suitable interface. Perhaps debugfs. Please don't use debugfs, thank you :-) Okay. I'll take a look at tracing for on-demand measurement. This reminds me a past series that adding tracepoints to vhost/net[1]. It can count zero/datacopy independently and even contains a sample program to show the stats. Thanks [1] https://lists.oasis-open.org/archives/virtio-dev/201403/msg00025.html
Re: [PATCH net 0/7] net: qualcomm: rmnet: Fix some existing functionality
From: Subash Abhinov Kasiviswanathan Date: Tue, 10 Oct 2017 18:20:22 -0600 > This series fixes some of the broken rmnet functionality from the initial > patchset. Bridge mode is re-written and made useable and the muxed_ep is > converted to hlist. > > Patches 1-5 are cleanups in preparation for these changes. > Patch 6 does the hlist conversion. > Patch 7 has the implementation of the rmnet bridge mode. > > Note that there will be a compilation error when merging net with net-next > due to the addition of the ext ack argument in > netdev_master_upper_dev_link / ndo_add_slave. I don't thnk any of these changes qualify as "fixes". They don't fix any bugs at all. These are cleanups, and the addition of a new feature used for debugging. Therefore these should be all targetted at the 'net-next' tree not 'net'.
Re: [Patch net-next] tcp: add a tracepoint for tcp_retransmit_skb()
On Tue, Oct 10, 2017 at 11:58:53PM +0200, Hannes Frederic Sowa wrote: > Alexei Starovoitov writes: > > > On Mon, Oct 09, 2017 at 10:35:47PM -0700, Cong Wang wrote: > > [...] > > >> + trace_tcp_retransmit_skb(sk, skb, segs); > > > > I'm happy to see new tracepoints being added to tcp stack, but I'm concerned > > with practical usability of them. > > Like the above tracepoint definition makes it not very useful from bpf > > point of view, > > since 'sk' pointer is not recored by as part of the tracepoint. > > In bpf/tracing world we prefer tracepoints to have raw pointers recorded > > in TP_STRUCT__entry() and _not_ printed in TP_printk() > > (since pointers are useless for userspace). > > Ack. > > Also could the TP_printk also use the socket cookies so they can get > associated with netlink dumps and as such also be associated to user > space processes? It could help against races while trying to associate > the socket with a process. ss already supports dumping those cookies > with -e. makes sense to me. > The corresponding commit would be: > > commit 33cf7c90fe2f97afb1cadaa0cfb782cb9d1b9ee2 > Author: Eric Dumazet > Date: Wed Mar 11 18:53:14 2015 -0700 > > net: add real socket cookies > > Right now they only get set when needed but as Eric already mentioned in > his commit log, this could be refined. actually we hit that too for completely different tracing use case. Indeed would be good to generate socket cookie unconditionally for all sockets. I don't think there is any harm.
Re: [Patch net-next] tcp: add a tracepoint for tcp_retransmit_skb()
On Tue, Oct 10, 2017 at 02:37:11PM -0700, Cong Wang wrote: > > > > More concrete, if you can make this trace_tcp_retransmit_skb() to record > > sk, skb pointers and err code at the end of __tcp_retransmit_skb() it will > > solve > > our need as well. > > > Note, currently I only call trace_tcp_retransmit_skb() for successful > retransmissions, since you mentioned err code, I guess you want it > for failures too? I am not sure if tracing unsuccessful TCP retransmissions > is meaningful here, I guess it's needed for BPF to track TCP states? > > It doesn't harm to add it, at least we can filter out err!=0 since we > only care about successful ones. right now only successful rxmit would be enough for us. Only that 'err' is hard to do via kprobe, since it's in some random register and debug info is generally not available. If you want to drop err for now and call tracepoint only on success, I think, that's fine too. Need to double check. Only sk and skb pointers are must have. Thanks!
Re: Regression in throughput between kvm guests over virtual bridge
On 2017年10月06日 04:07, Matthew Rosato wrote: On 09/25/2017 04:18 PM, Matthew Rosato wrote: On 09/22/2017 12:03 AM, Jason Wang wrote: On 2017年09月21日 03:38, Matthew Rosato wrote: Seems to make some progress on wakeup mitigation. Previous patch tries to reduce the unnecessary traversal of waitqueue during rx. Attached patch goes even further which disables rx polling during processing tx. Please try it to see if it has any difference. Unfortunately, this patch doesn't seem to have made a difference. I tried runs with both this patch and the previous patch applied, as well as only this patch applied for comparison (numbers from vhost thread of sending VM): 4.12   4.13 patch1  patch2  patch1+2 2.00%  +3.69%  +2.55%  +2.81%  +2.69%  [...] __wake_up_sync_key In each case, the regression in throughput was still present. This probably means some other cases of the wakeups were missed. Could you please record the callers of __wake_up_sync_key()? Hi Jason, With your 2 previous patches applied, every call to __wake_up_sync_key (for both sender and server vhost threads) shows the following stack trace: vhost-11478-11520 [002] 312.927229: __wake_up_sync_key <-sock_def_readable vhost-11478-11520 [002] 312.927230: => dev_hard_start_xmit => sch_direct_xmit => __dev_queue_xmit => br_dev_queue_push_xmit => br_forward_finish => __br_forward => br_handle_frame_finish => br_handle_frame => __netif_receive_skb_core => netif_receive_skb_internal => tun_get_user => tun_sendmsg => handle_tx => vhost_worker => kthread => kernel_thread_starter => kernel_thread_starter Ping... Jason, any other ideas or suggestions? Sorry for the late, recovering from a long holiday. Will go back to this soon. Thanks
[PATCH net-next] net: hns3: make local functions static
Fixes the following sparse warnings: drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c:464:5: warning: symbol 'hns3_change_all_ring_bd_num' was not declared. Should it be static? drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c:477:5: warning: symbol 'hns3_set_ringparam' was not declared. Should it be static? Signed-off-by: Wei Yongjun --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c index 9b36ce0..ddbd7f3 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hns3_ethtool.c @@ -461,7 +461,8 @@ static int hns3_get_rxnfc(struct net_device *netdev, return 0; } -int hns3_change_all_ring_bd_num(struct hns3_nic_priv *priv, u32 new_desc_num) +static int hns3_change_all_ring_bd_num(struct hns3_nic_priv *priv, + u32 new_desc_num) { struct hnae3_handle *h = priv->ae_handle; int i; @@ -474,7 +475,8 @@ int hns3_change_all_ring_bd_num(struct hns3_nic_priv *priv, u32 new_desc_num) return hns3_init_all_ring(priv); } -int hns3_set_ringparam(struct net_device *ndev, struct ethtool_ringparam *param) +static int hns3_set_ringparam(struct net_device *ndev, + struct ethtool_ringparam *param) { struct hns3_nic_priv *priv = netdev_priv(ndev); struct hnae3_handle *h = priv->ae_handle;
[PATCH net-next 2/2] net sched act_vlan: VLAN action rewrite to use RCU lock/unlock and update
Using a spinlock in the VLAN action causes performance issues when the VLAN action is used on multiple cores. Rewrote the VLAN action to use RCU read locking for reads and updates instead. Signed-off-by: Manish Kurup --- include/net/tc_act/tc_vlan.h | 21 - net/sched/act_vlan.c | 73 ++-- 2 files changed, 63 insertions(+), 31 deletions(-) diff --git a/include/net/tc_act/tc_vlan.h b/include/net/tc_act/tc_vlan.h index c2090df..67fd355 100644 --- a/include/net/tc_act/tc_vlan.h +++ b/include/net/tc_act/tc_vlan.h @@ -13,12 +13,17 @@ #include #include +struct tcf_vlan_params { + struct rcu_head rcu; + int tcfv_action; + u16 tcfv_push_vid; + __be16tcfv_push_proto; + u8tcfv_push_prio; +}; + struct tcf_vlan { struct tc_actioncommon; - int tcfv_action; - u16 tcfv_push_vid; - __be16 tcfv_push_proto; - u8 tcfv_push_prio; + struct tcf_vlan_params __rcu *vlan_p; }; #define to_vlan(a) ((struct tcf_vlan *)a) @@ -33,22 +38,22 @@ static inline bool is_tcf_vlan(const struct tc_action *a) static inline u32 tcf_vlan_action(const struct tc_action *a) { - return to_vlan(a)->tcfv_action; + return to_vlan(a)->vlan_p->tcfv_action; } static inline u16 tcf_vlan_push_vid(const struct tc_action *a) { - return to_vlan(a)->tcfv_push_vid; + return to_vlan(a)->vlan_p->tcfv_push_vid; } static inline __be16 tcf_vlan_push_proto(const struct tc_action *a) { - return to_vlan(a)->tcfv_push_proto; + return to_vlan(a)->vlan_p->tcfv_push_proto; } static inline u8 tcf_vlan_push_prio(const struct tc_action *a) { - return to_vlan(a)->tcfv_push_prio; + return to_vlan(a)->vlan_p->tcfv_push_prio; } #endif /* __NET_TC_VLAN_H */ diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c index 14c262c..9bb0236 100644 --- a/net/sched/act_vlan.c +++ b/net/sched/act_vlan.c @@ -29,31 +29,37 @@ static int tcf_vlan(struct sk_buff *skb, const struct tc_action *a, int action; int err; u16 tci; + struct tcf_vlan_params *p; tcf_lastuse_update(&v->tcf_tm); bstats_cpu_update(this_cpu_ptr(v->common.cpu_bstats), skb); - spin_lock(&v->tcf_lock); - action = v->tcf_action; - /* Ensure 'data' points at mac_header prior calling vlan manipulating * functions. */ if (skb_at_tc_ingress(skb)) skb_push_rcsum(skb, skb->mac_len); - switch (v->tcfv_action) { + rcu_read_lock(); + + action = READ_ONCE(v->tcf_action); + + p = rcu_dereference(v->vlan_p); + + switch (p->tcfv_action) { case TCA_VLAN_ACT_POP: err = skb_vlan_pop(skb); if (err) goto drop; break; + case TCA_VLAN_ACT_PUSH: - err = skb_vlan_push(skb, v->tcfv_push_proto, v->tcfv_push_vid | - (v->tcfv_push_prio << VLAN_PRIO_SHIFT)); + err = skb_vlan_push(skb, p->tcfv_push_proto, p->tcfv_push_vid | + (p->tcfv_push_prio << VLAN_PRIO_SHIFT)); if (err) goto drop; break; + case TCA_VLAN_ACT_MODIFY: /* No-op if no vlan tag (either hw-accel or in-payload) */ if (!skb_vlan_tagged(skb)) @@ -69,15 +75,16 @@ static int tcf_vlan(struct sk_buff *skb, const struct tc_action *a, goto drop; } /* replace the vid */ - tci = (tci & ~VLAN_VID_MASK) | v->tcfv_push_vid; + tci = (tci & ~VLAN_VID_MASK) | p->tcfv_push_vid; /* replace prio bits, if tcfv_push_prio specified */ - if (v->tcfv_push_prio) { + if (p->tcfv_push_prio) { tci &= ~VLAN_PRIO_MASK; - tci |= v->tcfv_push_prio << VLAN_PRIO_SHIFT; + tci |= p->tcfv_push_prio << VLAN_PRIO_SHIFT; } /* put updated tci as hwaccel tag */ - __vlan_hwaccel_put_tag(skb, v->tcfv_push_proto, tci); + __vlan_hwaccel_put_tag(skb, p->tcfv_push_proto, tci); break; + default: BUG(); } @@ -89,6 +96,7 @@ static int tcf_vlan(struct sk_buff *skb, const struct tc_action *a, qstats_drop_inc(this_cpu_ptr(v->common.cpu_qstats)); unlock: + rcu_read_unlock(); if (skb_at_tc_ingress(skb)) skb_pull_rcsum(skb, skb->mac_len); @@ -111,6 +119,7 @@ static int tcf_vlan_init(struct net *net, struct nlattr *nla, struct nlattr *tb[TCA_VLAN_MAX + 1]; struct tc_vlan *parm; struct tcf_vlan *v; + struct t
[PATCH v3] mac80211: aead api to reduce redundancy
Currently, the aes_ccm.c and aes_gcm.c are almost line by line copy of each other. This patch reduce code redundancy by moving the code in these two files to crypto/aead_api.c to make it a higher level aead api. The file aes_ccm.c and aes_gcm.c are removed and all the functions there are now implemented in their headers using the newly added aead api. Signed-off-by: Xiang Gao --- net/mac80211/Makefile | 3 +- net/mac80211/{aes_ccm.c => aead_api.c} | 40 ++-- net/mac80211/aead_api.h| 27 net/mac80211/aes_ccm.h | 42 + net/mac80211/aes_gcm.c | 109 - net/mac80211/aes_gcm.h | 38 +--- net/mac80211/wpa.c | 4 +- 7 files changed, 111 insertions(+), 152 deletions(-) rename net/mac80211/{aes_ccm.c => aead_api.c} (67%) create mode 100644 net/mac80211/aead_api.h delete mode 100644 net/mac80211/aes_gcm.c diff --git a/net/mac80211/Makefile b/net/mac80211/Makefile index 282912245938..80f25ff2f24b 100644 --- a/net/mac80211/Makefile +++ b/net/mac80211/Makefile @@ -6,6 +6,7 @@ mac80211-y := \ driver-ops.o \ sta_info.o \ wep.o \ + aead_api.o \ wpa.o \ scan.o offchannel.o \ ht.o agg-tx.o agg-rx.o \ @@ -15,8 +16,6 @@ mac80211-y := \ rate.o \ michael.o \ tkip.o \ - aes_ccm.o \ - aes_gcm.o \ aes_cmac.o \ aes_gmac.o \ fils_aead.o \ diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aead_api.c similarity index 67% rename from net/mac80211/aes_ccm.c rename to net/mac80211/aead_api.c index a4e0d59a40dd..cc48675ba742 100644 --- a/net/mac80211/aes_ccm.c +++ b/net/mac80211/aead_api.c @@ -1,4 +1,5 @@ /* + * Copyright 2014-2015, Qualcomm Atheros, Inc. * Copyright 2003-2004, Instant802 Networks, Inc. * Copyright 2005-2006, Devicescape Software, Inc. * @@ -12,30 +13,29 @@ #include #include #include +#include #include -#include -#include "key.h" -#include "aes_ccm.h" +#include "aead_api.h" -int ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad, - u8 *data, size_t data_len, u8 *mic, - size_t mic_len) +int aead_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad, size_t aad_len, +u8 *data, size_t data_len, u8 *mic) { + size_t mic_len = tfm->authsize; struct scatterlist sg[3]; struct aead_request *aead_req; int reqsize = sizeof(*aead_req) + crypto_aead_reqsize(tfm); u8 *__aad; - aead_req = kzalloc(reqsize + CCM_AAD_LEN, GFP_ATOMIC); + aead_req = kzalloc(reqsize + aad_len, GFP_ATOMIC); if (!aead_req) return -ENOMEM; __aad = (u8 *)aead_req + reqsize; - memcpy(__aad, aad, CCM_AAD_LEN); + memcpy(__aad, aad, aad_len); sg_init_table(sg, 3); - sg_set_buf(&sg[0], &__aad[2], be16_to_cpup((__be16 *)__aad)); + sg_set_buf(&sg[0], __aad, aad_len); sg_set_buf(&sg[1], data, data_len); sg_set_buf(&sg[2], mic, mic_len); @@ -49,10 +49,10 @@ int ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad, return 0; } -int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad, - u8 *data, size_t data_len, u8 *mic, - size_t mic_len) +int aead_decrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad, size_t aad_len, +u8 *data, size_t data_len, u8 *mic) { + size_t mic_len = tfm->authsize; struct scatterlist sg[3]; struct aead_request *aead_req; int reqsize = sizeof(*aead_req) + crypto_aead_reqsize(tfm); @@ -62,15 +62,15 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad, if (data_len == 0) return -EINVAL; - aead_req = kzalloc(reqsize + CCM_AAD_LEN, GFP_ATOMIC); + aead_req = kzalloc(reqsize + aad_len, GFP_ATOMIC); if (!aead_req) return -ENOMEM; __aad = (u8 *)aead_req + reqsize; - memcpy(__aad, aad, CCM_AAD_LEN); + memcpy(__aad, aad, aad_len); sg_init_table(sg, 3); - sg_set_buf(&sg[0], &__aad[2], be16_to_cpup((__be16 *)__aad)); + sg_set_buf(&sg[0], __aad, aad_len); sg_set_buf(&sg[1], data, data_len); sg_set_buf(&sg[2], mic, mic_len); @@ -84,14 +84,14 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad, return err; } -struct crypto_aead *ieee80211_aes_key_setup_encrypt(const u8 key[], - size_t key_len, - size_t mic_len) +struct crypto_aead * +aead_key_setup_encrypt(const char *alg, const u8 key[], + size_t key_len, size_t mic_len) { struct crypto_aead *tfm; int err; - tfm = cryp
[PATCH net-next 1/2] net sched act_vlan: Change stats update to use per-core stats
The VLAN action maintains one set of stats across all cores, and uses a spinlock to synchronize updates to it from the same. Changed this to use a per-CPU stats context instead. This change will result in better performance. Signed-off-by: Manish Kurup --- net/sched/act_vlan.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c index 16eb067..14c262c 100644 --- a/net/sched/act_vlan.c +++ b/net/sched/act_vlan.c @@ -30,9 +30,10 @@ static int tcf_vlan(struct sk_buff *skb, const struct tc_action *a, int err; u16 tci; - spin_lock(&v->tcf_lock); tcf_lastuse_update(&v->tcf_tm); - bstats_update(&v->tcf_bstats, skb); + bstats_cpu_update(this_cpu_ptr(v->common.cpu_bstats), skb); + + spin_lock(&v->tcf_lock); action = v->tcf_action; /* Ensure 'data' points at mac_header prior calling vlan manipulating @@ -85,7 +86,8 @@ static int tcf_vlan(struct sk_buff *skb, const struct tc_action *a, drop: action = TC_ACT_SHOT; - v->tcf_qstats.drops++; + qstats_drop_inc(this_cpu_ptr(v->common.cpu_qstats)); + unlock: if (skb_at_tc_ingress(skb)) skb_pull_rcsum(skb, skb->mac_len); @@ -172,7 +174,7 @@ static int tcf_vlan_init(struct net *net, struct nlattr *nla, if (!exists) { ret = tcf_idr_create(tn, parm->index, est, a, -&act_vlan_ops, bind, false); + &act_vlan_ops, bind, true); if (ret) return ret; -- 2.7.4
Re: [PATCH] mac80211: aead api to reduce redundancy
2017-10-09 3:09 GMT-04:00 Johannes Berg : > On Sun, 2017-10-08 at 01:43 -0400, Xiang Gao wrote: >> >> By the way, I'm still struggling on how to run unit tests. It might >> take time for me to make it run on my machine. > > I can run it easily, so don't worry about it too much. Running it is of > course much appreciated, but I don't really want to go and require that > right now, it takes a long time to run. > > If you do want to set it up, I suggest the vm scripts (hostap > repository in tests/hwsim/vm/ - you can use the kernel .config there as > a base to compile a kernel and then just kick it off from there, but it > can take a while to run. Thanks for your help on this. This information is actually very helpful to me. Since the unit test is not required, I will put working on this patch higher priority than unit tests. I will send out patches without running unit tests for now before I can make it run on my computer. But I'm still interested in trying to run it on my computer after I finish this patch. I will send PATCH v3 soon. Thanks > >> Hmm... good question. The reason is, aes_ccm.c and aes_gcm.c was >> almost exact copy of each other. But they have different copyright >> information. >> The copyright of aes_ccm.c was: >> >> Copyright 2006, Devicescape Software, Inc. >> Copyright 2003-2004, Instant802 Networks, Inc. >> >> and the copyright of aes_gcm.c was: >> >> Copyright 2014-2015, Qualcomm Atheros, Inc. >> >> I just don't know how to write the copyright for the new aead_api.c, >> so I does not put anything there. > > Heh, good point. Well, I guess we can pretend it wasn't already copied > before and just "keep" both. > > johannes
[PATCH net 1/2] Revert "net: defer call to cgroup_sk_alloc()"
This reverts commit fbb1fb4ad415cb31ce944f65a5ca700aaf73a227. This was not the proper fix, lets cleanly revert it, so that following patch can be carried to stable versions. sock_cgroup_ptr() callers do not expect a NULL return value. Signed-off-by: Eric Dumazet Cc: Johannes Weiner Cc: Tejun Heo --- kernel/cgroup/cgroup.c | 11 +++ net/core/sock.c | 3 ++- net/ipv4/inet_connection_sock.c | 5 - 3 files changed, 13 insertions(+), 6 deletions(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 3380a3e49af501e457991b2823020494cf32af80..44857278eb8aa6a2bbf27b7eb12137ef42628170 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5709,6 +5709,17 @@ void cgroup_sk_alloc(struct sock_cgroup_data *skcd) if (cgroup_sk_alloc_disabled) return; + /* Socket clone path */ + if (skcd->val) { + /* +* We might be cloning a socket which is left in an empty +* cgroup and the cgroup might have already been rmdir'd. +* Don't use cgroup_get_live(). +*/ + cgroup_get(sock_cgroup_ptr(skcd)); + return; + } + rcu_read_lock(); while (true) { diff --git a/net/core/sock.c b/net/core/sock.c index 4499e31538132ed59a16d92e6f6b923e776df84e..70c6ccbdf49f2f8a5a0f7c41c7849ea01459be50 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1680,7 +1680,6 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority) /* sk->sk_memcg will be populated at accept() time */ newsk->sk_memcg = NULL; - memset(&newsk->sk_cgrp_data, 0, sizeof(newsk->sk_cgrp_data)); atomic_set(&newsk->sk_drops, 0); newsk->sk_send_head = NULL; @@ -1719,6 +1718,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority) newsk->sk_incoming_cpu = raw_smp_processor_id(); atomic64_set(&newsk->sk_cookie, 0); + cgroup_sk_alloc(&newsk->sk_cgrp_data); + /* * Before updating sk_refcnt, we must commit prior changes to memory * (Documentation/RCU/rculist_nulls.txt for details) diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index d32c74507314cc4b91d040de8e877e4bd8204106..67aec7a106860b26c929fea1624d652c87972f04 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -26,8 +26,6 @@ #include #include #include -#include -#include #ifdef INET_CSK_DEBUG const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n"; @@ -478,9 +476,6 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern) spin_unlock_bh(&queue->fastopenq.lock); } mem_cgroup_sk_alloc(newsk); - cgroup_sk_alloc(&newsk->sk_cgrp_data); - sock_update_classid(&newsk->sk_cgrp_data); - sock_update_netprioidx(&newsk->sk_cgrp_data); out: release_sock(sk); if (req) -- 2.15.0.rc0.271.g36b669edcc-goog
[PATCH net 2/2] net: call cgroup_sk_alloc() earlier in sk_clone_lock()
If for some reason, the newly allocated child need to be freed, we will call cgroup_put() (via sk_free_unlock_clone()) while the corresponding cgroup_get() was not yet done, and we will free memory too soon. Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets") Signed-off-by: Eric Dumazet Cc: Johannes Weiner Cc: Tejun Heo --- net/core/sock.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index 70c6ccbdf49f2f8a5a0f7c41c7849ea01459be50..415f441c63b9e2ff8feb010f44ca27303c72aaa1 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1687,6 +1687,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority) atomic_set(&newsk->sk_zckey, 0); sock_reset_flag(newsk, SOCK_DONE); + cgroup_sk_alloc(&newsk->sk_cgrp_data); rcu_read_lock(); filter = rcu_dereference(sk->sk_filter); @@ -1718,8 +1719,6 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority) newsk->sk_incoming_cpu = raw_smp_processor_id(); atomic64_set(&newsk->sk_cookie, 0); - cgroup_sk_alloc(&newsk->sk_cgrp_data); - /* * Before updating sk_refcnt, we must commit prior changes to memory * (Documentation/RCU/rculist_nulls.txt for details) -- 2.15.0.rc0.271.g36b669edcc-goog
RE: [net-next,RESEND] igb: add function to get maximum RSS queues
> From: netdev-ow...@vger.kernel.org [mailto:netdev- > ow...@vger.kernel.org] On Behalf Of Zhang Shengju > Sent: Tuesday, September 19, 2017 6:41 AM > To: Kirsher, Jeffrey T ; intel-wired- > l...@lists.osuosl.org; netdev@vger.kernel.org > Subject: [net-next,RESEND] igb: add function to get maximum RSS queues > > This patch adds a new function igb_get_max_rss_queues() to get maximum > RSS queues, this will reduce duplicate code and facilitate future > maintenance. > > Signed-off-by: Zhang Shengju > --- > drivers/net/ethernet/intel/igb/igb.h | 1 + > drivers/net/ethernet/intel/igb/igb_ethtool.c | 32 > +--- > drivers/net/ethernet/intel/igb/igb_main.c| 12 +-- > 3 files changed, 12 insertions(+), 33 deletions(-) > Tested-by: Aaron Brown
[next-queue PATCH v5 0/5] TSN: Add qdisc based config interface for CBS
Hi, Changes since v4: - Added a software implementation of the CBS algorithm; Changes since v3: - None, only a clean patchset without old patches; Changes since v2: - squashed the patch introducing the userspace API into the patch implementing CBS; Changes since v1: - Solved the mqprio dependency; - Fixed a mqprio bug, that caused the inner qdisc to have a wrong dev_queue associated with it; Changes from the RFC: - Fixed comments from Henrik Austad; - Simplified the Qdisc, using the generic implementation of callbacks where possible; - Small refactor on the driver (igb) code; This patchset is a proposal of how the Traffic Control subsystem can be used to offload the configuration of the Credit Based Shaper (defined in the IEEE 802.1Q-2014 Section 8.6.8.2) into supported network devices. As part of this work, we've assessed previous public discussions related to TSN enabling: patches from Henrik Austad (Cisco), the presentation from Eric Mann at Linux Plumbers 2012, patches from Gangfeng Huang (National Instruments) and the current state of the OpenAVNU project (https://github.com/AVnu/OpenAvnu/). Overview Time-sensitive Networking (TSN) is a set of standards that aim to address resources availability for providing bandwidth reservation and bounded latency on Ethernet based LANs. The proposal described here aims to cover mainly what is needed to enable the following standards: 802.1Qat and 802.1Qav. The initial target of this work is the Intel i210 NIC, but other controllers' datasheet were also taken into account, like the Renesas RZ/A1H RZ/A1M group and the Synopsis DesignWare Ethernet QoS controller. Proposal Feature-wise, what is covered here is the configuration interfaces for HW implementations of the Credit-Based shaper (CBS, 802.1Qav). CBS is a per-queue shaper. Given that this feature is related to traffic shaping, and that the traffic control subsystem already provides a queueing discipline that offloads config into the device driver (i.e. mqprio), designing a new qdisc for the specific purpose of offloading the config for the CBS shaper seemed like a good fit. For steering traffic into the correct queues, we use the socket option SO_PRIORITY and then a mechanism to map priority to traffic classes / Tx queues. The qdisc mqprio is currently used in our tests. As for the CBS config interface, this patchset is proposing a new qdisc called 'cbs'. Its 'tc' cmd line is: $ tc qdisc add dev IFACE parent ID cbs locredit N hicredit M sendslope S \ idleslope I Note that the parameters for this qdisc are the ones defined by the 802.1Q-2014 spec, so no hardware specific functionality is exposed here. Per-stream shaping, as defined by IEEE 802.1Q-2014 Section 34.6.1, is not yet covered by this proposal. Testing this RFC Attached to this cover letter are: - calculate_cbs_params.py: A Python script to calculate the parameters to the CBS queueing discipline; - tsn-talker.c: A sample C implementation of the talker side of a stream; - tsn-listener.c: A sample C implementation of the listener side of a stream; For testing the patches of this series, you may want to use the attached samples to this cover letter and use the 'mqprio' qdisc to setup the priorities to Tx queues mapping, together with the 'cbs' qdisc to configure the HW shaper of the i210 controller: 1) Setup priorities to traffic classes to hardware queues mapping $ tc qdisc replace dev ens4 handle 100: parent root mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 For a more detailed explanation, see mqprio(8), in short, this command will map traffic with priority 3 to the hardware queue 0, traffic with priority 2 to hardware queue 1, and the rest will be mapped to hardware queues 2 and 3. 2) Check scheme. You want to get the inner qdiscs ID from the bottom up $ tc -g class show dev ens4 Ex.: +---(100:3) mqprio |+---(100:6) mqprio |+---(100:7) mqprio | +---(100:2) mqprio |+---(100:5) mqprio | +---(100:1) mqprio +---(100:4) mqprio * Here '100:4' is Tx Queue #0 and '100:5' is Tx Queue #1. 3) Calculate CBS parameters for classes A and B. i.e. BW for A is 20Mbps and for B is 10Mbps: $ calc_cbs_params.py -A 2 -a 1500 -B 1 -b 1500 4) Configure CBS for traffic class A (priority 3) as provided by the script: $ tc qdisc replace dev ens4 parent 100:4 cbs locredit -1470 \ hicredit 30 sendslope -98 idleslope 2 5) Configure CBS for traffic class B (priority 2): $ tc qdisc replace dev ens4 parent 100:5 cbs \ locredit -1485 hicredit 31 sendslope -99 idleslope 1 6) Run Listener: $ ./tsn-listener -d 01:AA:AA:AA:AA:AA -i ens4 -s 1500 7) Run Talker for class A (prio 3 here), compiled from samples/tsn/talker.c $ ./tsn-talker -d 01:AA:AA:AA:AA:AA -i ens4 -p 3 -s 1500 * The bandwidth displayed on the listener output at this stage should be very close to the one configured for class A.
RE: [PATCH net] driver: e1000: fix race condition between e1000_down() and e1000_watchdog
> From: netdev-ow...@vger.kernel.org [mailto:netdev- > ow...@vger.kernel.org] On Behalf Of Vincenzo Maffione > Sent: Saturday, September 16, 2017 9:00 AM > To: Kirsher, Jeffrey T > Cc: intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; linux- > ker...@vger.kernel.org; Vincenzo Maffione > Subject: [PATCH net] driver: e1000: fix race condition between e1000_down() > and e1000_watchdog > > This patch fixes a race condition that can result into the interface being > up and carrier on, but with transmits disabled in the hardware. > The bug may show up by repeatedly IFF_DOWN+IFF_UP the interface, which > allows e1000_watchdog() interleave with e1000_down(). > > CPU x CPU y > > e1000_down(): > netif_carrier_off() > e1000_watchdog(): > if (carrier == off) { > netif_carrier_on(); > enable_hw_transmit(); > } > disable_hw_transmit(); > e1000_watchdog(): > /* carrier on, do nothing */ > > Signed-off-by: Vincenzo Maffione > --- > drivers/net/ethernet/intel/e1000/e1000_main.c | 11 +-- > 1 file changed, 9 insertions(+), 2 deletions(-) Tested-by: Aaron Brown
[next-queue PATCH v5 1/5] net/sched: Check for null dev_queue on create flow
From: Jesus Sanchez-Palencia In qdisc_alloc() the dev_queue pointer was used without any checks being performed. If qdisc_create() gets a null dev_queue pointer, it just passes it along to qdisc_alloc(), leading to a crash. That happens if a root qdisc implements select_queue() and returns a null dev_queue pointer for an "invalid handle", for example, or if the dev_queue associated with the parent qdisc is null. This patch is in preparation for the next in this series, where select_queue() is being added to mqprio and as it may return a null dev_queue. Signed-off-by: Jesus Sanchez-Palencia --- net/sched/sch_generic.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index a0a198768aad..de2408f1ccd3 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -603,8 +603,14 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue, struct Qdisc *sch; unsigned int size = QDISC_ALIGN(sizeof(*sch)) + ops->priv_size; int err = -ENOBUFS; - struct net_device *dev = dev_queue->dev; + struct net_device *dev; + + if (!dev_queue) { + err = -EINVAL; + goto errout; + } + dev = dev_queue->dev; p = kzalloc_node(size, GFP_KERNEL, netdev_queue_numa_node_read(dev_queue)); -- 2.14.2
[iproute2 net-next v2 3/3] man: Add initial manpage for tc-cbs(8)
Signed-off-by: Vinicius Costa Gomes --- man/man8/tc-cbs.8 | 112 ++ 1 file changed, 112 insertions(+) create mode 100644 man/man8/tc-cbs.8 diff --git a/man/man8/tc-cbs.8 b/man/man8/tc-cbs.8 new file mode 100644 index ..97e00c84 --- /dev/null +++ b/man/man8/tc-cbs.8 @@ -0,0 +1,112 @@ +.TH CBS 8 "18 Sept 2017" "iproute2" "Linux" +.SH NAME +CBS \- Credit Based Shaper (CBS) Qdisc +.SH SYNOPSIS +.B tc qdisc ... dev +dev +.B parent +classid +.B [ handle +major: +.B ] cbs idleslope +idleslope +.B sendslope +sendslope +.B hicredit +hicredit +.B locredit +locredit +.B [ offload +0|1 +.B ] + +.SH DESCRIPTION +The CBS (Credit Based Shaper) qdisc implements the shaping algorithm +defined by the IEEE 802.1Q-2014 Section 8.6.8.2, which applies a well +defined rate limiting method to the traffic. + +This queueing discipline is intended to be used by TSN (Time Sensitive +Networking) applications, the CBS parameters are derived directly by +what is described by the Annex L of the IEEE 802.1Q-2014 +Sepcification. The algorithm and how it affects the latency are +detailed there. + +CBS is meant to be installed under another qdisc that maps packet +flows to traffic classes, one example is +.BR mqprio(8). + +.SH PARAMETERS +.TP +idleslope +Idleslope is the rate of credits that is accumulated (in kilobits per +second) when there is at least one packet waiting for transmission. +Packets are transmitted when the current value of credits is equal or +greater than zero. When there is no packet to be transmitted the +amount of credits is set to zero. This is the main tunable of the CBS +algorithm. +.TP +sendslope +Sendslope is the rate of credits that is depleted (it should be a +negative number of kilobits per second) when a transmission is +ocurring. It can be calculated as follows, (IEEE 802.1Q-2014 Section +8.6.8.2 item g): + +sendslope = idleslope - port_transmit_rate + +.TP +hicredit +Hicredit defines the maximum amount of credits (in bytes) that can be +accumulated. Hicredit depends on the characteristics of interfering +traffic, 'max_interference_size' is the maximum size of any burst of +traffic that can delay the transmission of a frame that is available +for transmission for this traffic class, (IEEE 802.1Q-2014 Annex L, +Equation L-3): + +hicredit = max_interference_size * (idleslope / port_transmit_rate) + +.TP +locredit +Locredit is the minimum amount of credits that can be reached. It is a +function of the traffic flowing through this qdisc (IEEE 802.1Q-2014 +Annex L, Equation L-2): + +locredit = max_frame_size * (sendslope / port_transmit_rate) + +.TP +offload +When +.B offload +is 1, +.BR cbs(8) +will try to configure the network interface so the CBS algorithm runs +in the controller. The default is 0. + +.SH EXAMPLES + +CBS is used to enforce a Quality of Service by limiting the data rate +of a traffic class, to separate packets into traffic classes the user +may choose +.BR mqprio(8), +and configure it like this: + +.EX +# tc qdisc add dev eth0 handle 100: parent root mqprio num_tc 3 \\ + map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\ + queues 1@0 1@1 2@2 \\ + hw 0 +.EE +.P +To replace the current queuing disciple by CBS in the current queueing +discipline connected to traffic class number 0, issue: +.P +.EX +# tc qdisc replace dev eth0 parent 100:4 cbs \\ + locredit -1470 hicredit 30 sendslope -98 idleslope 2 +.EE + +These values are obtained from the following parameters, idleslope is +20mbit/s, the transmission rate is 1Gbit/s and the maximum interfering +frame size is 1500 bytes. + +.SH AUTHORS +Vinicius Costa Gomes -- 2.14.2
[iproute2 net-next v2 1/3] update headers with CBS API [ONLY FOR TESTING]
The headers will be updated when iproute2 fetches the headers from net-next, this patch is only to ease testing. Signed-off-by: Vinicius Costa Gomes --- include/linux/pkt_sched.h | 18 ++ 1 file changed, 18 insertions(+) diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h index 099bf552..41e349df 100644 --- a/include/linux/pkt_sched.h +++ b/include/linux/pkt_sched.h @@ -871,4 +871,22 @@ struct tc_pie_xstats { __u32 maxq; /* maximum queue size */ __u32 ecn_mark; /* packets marked with ecn*/ }; + +/* CBS */ +struct tc_cbs_qopt { + __u8 offload; + __s32 hicredit; + __s32 locredit; + __s32 idleslope; + __s32 sendslope; +}; + +enum { + TCA_CBS_UNSPEC, + TCA_CBS_PARMS, + __TCA_CBS_MAX, +}; + +#define TCA_CBS_MAX (__TCA_CBS_MAX - 1) + #endif -- 2.14.2
[iproute2 net-next v2 2/3] tc: Add support for the CBS qdisc
The Credit Based Shaper (CBS) queueing discipline allows bandwidth reservation with sub-milisecond precision. It is defined by the 802.1Q-2014 specification (section 8.6.8.2 and Annex L). The syntax is: tc qdisc add dev DEV parent NODE cbs locredit hicredit sendslope idleslope (The order is not important) Signed-off-by: Vinicius Costa Gomes --- tc/Makefile | 1 + tc/q_cbs.c | 142 2 files changed, 143 insertions(+) create mode 100644 tc/q_cbs.c diff --git a/tc/Makefile b/tc/Makefile index 777de5e6..24bd3e2e 100644 --- a/tc/Makefile +++ b/tc/Makefile @@ -69,6 +69,7 @@ TCMODULES += q_hhf.o TCMODULES += q_clsact.o TCMODULES += e_bpf.o TCMODULES += f_matchall.o +TCMODULES += q_cbs.o TCSO := ifeq ($(TC_CONFIG_ATM),y) diff --git a/tc/q_cbs.c b/tc/q_cbs.c new file mode 100644 index ..e53be654 --- /dev/null +++ b/tc/q_cbs.c @@ -0,0 +1,142 @@ +/* + * q_cbs.c CBS. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Authors:Vinicius Costa Gomes + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "utils.h" +#include "tc_util.h" + +static void explain(void) +{ + fprintf(stderr, "Usage: ... cbs hicredit BYTES locredit BYTES sendslope BPS idleslope BPS\n"); + fprintf(stderr, " [offload 0|1]\n"); + +} + +static void explain1(const char *arg, const char *val) +{ + fprintf(stderr, "cbs: illegal value for \"%s\": \"%s\"\n", arg, val); +} + +static int cbs_parse_opt(struct qdisc_util *qu, int argc, char **argv, struct nlmsghdr *n) +{ + struct tc_cbs_qopt opt = {}; + struct rtattr *tail; + + while (argc > 0) { + if (matches(*argv, "offload") == 0) { + NEXT_ARG(); + if (opt.offload) { + fprintf(stderr, "cbs: duplicate \"offload\" specification\n"); + return -1; + } + if (get_u8(&opt.offload, *argv, 0)) { + explain1("offload", *argv); + return -1; + } + } else if (matches(*argv, "hicredit") == 0) { + NEXT_ARG(); + if (opt.hicredit) { + fprintf(stderr, "cbs: duplicate \"hicredit\" specification\n"); + return -1; + } + if (get_s32(&opt.hicredit, *argv, 0)) { + explain1("hicredit", *argv); + return -1; + } + } else if (matches(*argv, "locredit") == 0) { + NEXT_ARG(); + if (opt.locredit) { + fprintf(stderr, "cbs: duplicate \"locredit\" specification\n"); + return -1; + } + if (get_s32(&opt.locredit, *argv, 0)) { + explain1("locredit", *argv); + return -1; + } + } else if (matches(*argv, "sendslope") == 0) { + NEXT_ARG(); + if (opt.sendslope) { + fprintf(stderr, "cbs: duplicate \"sendslope\" specification\n"); + return -1; + } + if (get_s32(&opt.sendslope, *argv, 0)) { + explain1("sendslope", *argv); + return -1; + } + } else if (matches(*argv, "idleslope") == 0) { + NEXT_ARG(); + if (opt.idleslope) { + fprintf(stderr, "cbs: duplicate \"idleslope\" specification\n"); + return -1; + } + if (get_s32(&opt.idleslope, *argv, 0)) { + explain1("idleslope", *argv); + return -1; + } + } else if (strcmp(*argv, "help") == 0) { + explain(); + return -1; + } else { + fprintf(stderr, "cbs: unknown parameter \"%s\"\n", *argv); + explain(); + return -1; + } + argc--; argv++; + } + + tail = NLMSG_TAIL(n); + addattr_l(n, 1024, TCA_OPTIONS, NULL, 0); +
[next-queue PATCH v5 3/5] net/sched: Introduce Credit Based Shaper (CBS) qdisc
This queueing discipline implements the shaper algorithm defined by the 802.1Q-2014 Section 8.6.8.2 and detailed in Annex L. It's primary usage is to apply some bandwidth reservation to user defined traffic classes, which are mapped to different queues via the mqprio qdisc. Only a simple software implementation is added for now. Signed-off-by: Vinicius Costa Gomes Signed-off-by: Jesus Sanchez-Palencia --- include/linux/netdevice.h | 1 + include/net/pkt_sched.h| 9 ++ include/uapi/linux/pkt_sched.h | 18 +++ net/sched/Kconfig | 11 ++ net/sched/Makefile | 1 + net/sched/sch_cbs.c| 305 + 6 files changed, 345 insertions(+) create mode 100644 net/sched/sch_cbs.c diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 31bb3010c69b..1f6c44ef5b21 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -775,6 +775,7 @@ enum tc_setup_type { TC_SETUP_CLSFLOWER, TC_SETUP_CLSMATCHALL, TC_SETUP_CLSBPF, + TC_SETUP_CBS, }; /* These structures hold the attributes of xdp state that are being passed diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h index 259bc191ba59..7c597b050b36 100644 --- a/include/net/pkt_sched.h +++ b/include/net/pkt_sched.h @@ -146,4 +146,13 @@ static inline bool is_classid_clsact_egress(u32 classid) TC_H_MIN(classid) == TC_H_MIN(TC_H_MIN_EGRESS); } +struct tc_cbs_qopt_offload { + u8 enable; + s32 queue; + s32 hicredit; + s32 locredit; + s32 idleslope; + s32 sendslope; +}; + #endif diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h index 099bf5528fed..41e349df4bf4 100644 --- a/include/uapi/linux/pkt_sched.h +++ b/include/uapi/linux/pkt_sched.h @@ -871,4 +871,22 @@ struct tc_pie_xstats { __u32 maxq; /* maximum queue size */ __u32 ecn_mark; /* packets marked with ecn*/ }; + +/* CBS */ +struct tc_cbs_qopt { + __u8 offload; + __s32 hicredit; + __s32 locredit; + __s32 idleslope; + __s32 sendslope; +}; + +enum { + TCA_CBS_UNSPEC, + TCA_CBS_PARMS, + __TCA_CBS_MAX, +}; + +#define TCA_CBS_MAX (__TCA_CBS_MAX - 1) + #endif diff --git a/net/sched/Kconfig b/net/sched/Kconfig index e70ed26485a2..c03d86a7775e 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -172,6 +172,17 @@ config NET_SCH_TBF To compile this code as a module, choose M here: the module will be called sch_tbf. +config NET_SCH_CBS + tristate "Credit Based Shaper (CBS)" + ---help--- + Say Y here if you want to use the Credit Based Shaper (CBS) packet + scheduling algorithm. + + See the top of for more details. + + To compile this code as a module, choose M here: the + module will be called sch_cbs. + config NET_SCH_GRED tristate "Generic Random Early Detection (GRED)" ---help--- diff --git a/net/sched/Makefile b/net/sched/Makefile index 7b915d226de7..80c8f92d162d 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -52,6 +52,7 @@ obj-$(CONFIG_NET_SCH_FQ_CODEL)+= sch_fq_codel.o obj-$(CONFIG_NET_SCH_FQ) += sch_fq.o obj-$(CONFIG_NET_SCH_HHF) += sch_hhf.o obj-$(CONFIG_NET_SCH_PIE) += sch_pie.o +obj-$(CONFIG_NET_SCH_CBS) += sch_cbs.o obj-$(CONFIG_NET_CLS_U32) += cls_u32.o obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o diff --git a/net/sched/sch_cbs.c b/net/sched/sch_cbs.c new file mode 100644 index ..5e1f72df1abd --- /dev/null +++ b/net/sched/sch_cbs.c @@ -0,0 +1,305 @@ +/* + * net/sched/sch_cbs.c Credit Based Shaper + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Authors:Vinicius Costa Gomes + * + */ + +/* Credit Based Shaper (CBS) + * = + * + * This is a simple rate-limiting shaper aimed at TSN applications on + * systems with known traffic workloads. + * + * Its algorithm is defined by the IEEE 802.1Q-2014 Specification, + * Section 8.6.8.2, and explained in more detail in the Annex L of the + * same specification. + * + * There are four tunables to be considered: + * + * 'idleslope': Idleslope is the rate of credits that is + * accumulated (in kilobits per second) when there is at least + * one packet waiting for transmission. Packets are transmitted + * when the current value of credits is equal or greater than + * zero. When there is no packet to be transmitted the amount of + * credits is set to zero. This is the main tunable of the CBS + * algorithm. + * + * 'sendslope': + * Sendslope is the rate of credits that is
[next-queue PATCH v5 4/5] net/sched: Add support for HW offloading for CBS
This adds support for offloading the CBS algorithm to the controller, if supported. Signed-off-by: Vinicius Costa Gomes --- net/sched/sch_cbs.c | 92 ++--- 1 file changed, 81 insertions(+), 11 deletions(-) diff --git a/net/sched/sch_cbs.c b/net/sched/sch_cbs.c index 5e1f72df1abd..2812bac4092b 100644 --- a/net/sched/sch_cbs.c +++ b/net/sched/sch_cbs.c @@ -69,6 +69,7 @@ struct cbs_sched_data { bool offload; + int queue; s64 port_rate; /* in bytes/s */ s64 last; /* timestamp in ns */ s64 credits; /* in bytes */ @@ -81,6 +82,11 @@ struct cbs_sched_data { struct sk_buff *(*dequeue)(struct Qdisc *sch); }; +static int cbs_enqueue_offload(struct sk_buff *skb, struct Qdisc *sch) +{ + return qdisc_enqueue_tail(skb, sch); +} + static int cbs_enqueue_soft(struct sk_buff *skb, struct Qdisc *sch) { struct cbs_sched_data *q = qdisc_priv(sch); @@ -179,6 +185,11 @@ static struct sk_buff *cbs_dequeue_soft(struct Qdisc *sch) return skb; } +static struct sk_buff *cbs_dequeue_offload(struct Qdisc *sch) +{ + return qdisc_dequeue_head(sch); +} + static struct sk_buff *cbs_dequeue(struct Qdisc *sch) { struct cbs_sched_data *q = qdisc_priv(sch); @@ -190,14 +201,37 @@ static const struct nla_policy cbs_policy[TCA_CBS_MAX + 1] = { [TCA_CBS_PARMS] = { .len = sizeof(struct tc_cbs_qopt) }, }; +static void disable_cbs_offload(struct net_device *dev, + struct cbs_sched_data *q) +{ + struct tc_cbs_qopt_offload cbs = { }; + const struct net_device_ops *ops; + int err; + + if (!q->offload) + return; + + ops = dev->netdev_ops; + if (!ops->ndo_setup_tc) + return; + + cbs.queue = q->queue; + cbs.enable = 0; + + err = ops->ndo_setup_tc(dev, TC_SETUP_CBS, &cbs); + if (err < 0) + pr_warn("Couldn't disable CBS offload for queue %d\n", + cbs.queue); +} + static int cbs_change(struct Qdisc *sch, struct nlattr *opt) { struct cbs_sched_data *q = qdisc_priv(sch); struct net_device *dev = qdisc_dev(sch); + struct tc_cbs_qopt_offload cbs = { }; struct nlattr *tb[TCA_CBS_MAX + 1]; - struct ethtool_link_ksettings ecmd; + const struct net_device_ops *ops; struct tc_cbs_qopt *qopt; - s64 link_speed; int err; err = nla_parse_nested(tb, TCA_CBS_MAX, opt, cbs_policy, NULL); @@ -209,18 +243,48 @@ static int cbs_change(struct Qdisc *sch, struct nlattr *opt) qopt = nla_data(tb[TCA_CBS_PARMS]); - if (qopt->offload) - return -EOPNOTSUPP; + q->enqueue = cbs_enqueue_offload; + q->dequeue = cbs_dequeue_offload; - if (!__ethtool_get_link_ksettings(dev, &ecmd)) - link_speed = ecmd.base.speed; - else - link_speed = SPEED_1000; + if (!qopt->offload) { + struct ethtool_link_ksettings ecmd; + s64 link_speed; - q->port_rate = link_speed * 1000 * BYTES_PER_KBIT; + if (!__ethtool_get_link_ksettings(dev, &ecmd)) + link_speed = ecmd.base.speed; + else + link_speed = SPEED_1000; - q->enqueue = cbs_enqueue_soft; - q->dequeue = cbs_dequeue_soft; + q->port_rate = link_speed * 1000 * BYTES_PER_KBIT; + + q->enqueue = cbs_enqueue_soft; + q->dequeue = cbs_dequeue_soft; + + disable_cbs_offload(dev, q); + + err = 0; + goto done; + } + + cbs.queue = q->queue; + + cbs.enable = 1; + cbs.hicredit = qopt->hicredit; + cbs.locredit = qopt->locredit; + cbs.idleslope = qopt->idleslope; + cbs.sendslope = qopt->sendslope; + + ops = dev->netdev_ops; + + err = -EOPNOTSUPP; + if (!ops->ndo_setup_tc) + goto done; + + err = ops->ndo_setup_tc(dev, TC_SETUP_CBS, &cbs); + +done: + if (err < 0) + return err; q->hicredit = qopt->hicredit; q->locredit = qopt->locredit; @@ -234,10 +298,13 @@ static int cbs_change(struct Qdisc *sch, struct nlattr *opt) static int cbs_init(struct Qdisc *sch, struct nlattr *opt) { struct cbs_sched_data *q = qdisc_priv(sch); + struct net_device *dev = qdisc_dev(sch); if (!opt) return -EINVAL; + q->queue = sch->dev_queue - netdev_get_tx_queue(dev, 0); + qdisc_watchdog_init(&q->watchdog, sch); return cbs_change(sch, opt); @@ -246,8 +313,11 @@ static int cbs_init(struct Qdisc *sch, struct nlattr *opt) static void cbs_destroy(struct Qdisc *sch) { struct cbs_sched_data *q = qdisc_priv(sch); + struct net_device *dev = qdisc_dev(sch); qdisc_watchdog_cancel(&q->watchdog); + + disable_cbs_offload(dev,
[next-queue PATCH v5 5/5] igb: Add support for CBS offload
From: Andre Guedes This patch adds support for Credit-Based Shaper (CBS) qdisc offload from Traffic Control system. This support enable us to leverage the Forwarding and Queuing for Time-Sensitive Streams (FQTSS) features from Intel i210 Ethernet Controller. FQTSS is the former 802.1Qav standard which was merged into 802.1Q in 2014. It enables traffic prioritization and bandwidth reservation via the Credit-Based Shaper which is implemented in hardware by i210 controller. The patch introduces the igb_setup_tc() function which implements the support for CBS qdisc hardware offload in the IGB driver. CBS offload is the only traffic control offload supported by the driver at the moment. FQTSS transmission mode from i210 controller is automatically enabled by the IGB driver when the CBS is enabled for the first hardware queue. Likewise, FQTSS mode is automatically disabled when CBS is disabled for the last hardware queue. Changing FQTSS mode requires NIC reset. FQTSS feature is supported by i210 controller only. Signed-off-by: Andre Guedes --- drivers/net/ethernet/intel/igb/e1000_defines.h | 23 ++ drivers/net/ethernet/intel/igb/e1000_regs.h| 8 + drivers/net/ethernet/intel/igb/igb.h | 6 + drivers/net/ethernet/intel/igb/igb_main.c | 347 + 4 files changed, 384 insertions(+) diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h b/drivers/net/ethernet/intel/igb/e1000_defines.h index 1de82f247312..83cabff1e0ab 100644 --- a/drivers/net/ethernet/intel/igb/e1000_defines.h +++ b/drivers/net/ethernet/intel/igb/e1000_defines.h @@ -353,7 +353,18 @@ #define E1000_RXPBS_CFG_TS_EN 0x8000 #define I210_RXPBSIZE_DEFAULT 0x00A2 /* RXPBSIZE default */ +#define I210_RXPBSIZE_MASK 0x003F +#define I210_RXPBSIZE_PB_32KB 0x0020 #define I210_TXPBSIZE_DEFAULT 0x0414 /* TXPBSIZE default */ +#define I210_TXPBSIZE_MASK 0xC0FF +#define I210_TXPBSIZE_PB0_8KB (8 << 0) +#define I210_TXPBSIZE_PB1_8KB (8 << 6) +#define I210_TXPBSIZE_PB2_4KB (4 << 12) +#define I210_TXPBSIZE_PB3_4KB (4 << 18) + +#define I210_DTXMXPKTSZ_DEFAULT0x0098 + +#define I210_SR_QUEUES_NUM 2 /* SerDes Control */ #define E1000_SCTL_DISABLE_SERDES_LOOPBACK 0x0400 @@ -1051,4 +1062,16 @@ #define E1000_VLAPQF_P_VALID(_n) (0x1 << (3 + (_n) * 4)) #define E1000_VLAPQF_QUEUE_MASK0x03 +/* TX Qav Control fields */ +#define E1000_TQAVCTRL_XMIT_MODE BIT(0) +#define E1000_TQAVCTRL_DATAFETCHARBBIT(4) +#define E1000_TQAVCTRL_DATATRANARB BIT(8) + +/* TX Qav Credit Control fields */ +#define E1000_TQAVCC_IDLESLOPE_MASK0x +#define E1000_TQAVCC_QUEUEMODE BIT(31) + +/* Transmit Descriptor Control fields */ +#define E1000_TXDCTL_PRIORITY BIT(27) + #endif diff --git a/drivers/net/ethernet/intel/igb/e1000_regs.h b/drivers/net/ethernet/intel/igb/e1000_regs.h index 58adbf234e07..8eee081d395f 100644 --- a/drivers/net/ethernet/intel/igb/e1000_regs.h +++ b/drivers/net/ethernet/intel/igb/e1000_regs.h @@ -421,6 +421,14 @@ do { \ #define E1000_I210_FLA 0x1201C +#define E1000_I210_DTXMXPKTSZ 0x355C + +#define E1000_I210_TXDCTL(_n) (0x0E028 + ((_n) * 0x40)) + +#define E1000_I210_TQAVCTRL0x3570 +#define E1000_I210_TQAVCC(_n) (0x3004 + ((_n) * 0x40)) +#define E1000_I210_TQAVHC(_n) (0x300C + ((_n) * 0x40)) + #define E1000_INVM_DATA_REG(_n)(0x12120 + 4*(_n)) #define E1000_INVM_SIZE64 /* Number of INVM Data Registers */ diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h index 06ffb2bc713e..92845692087a 100644 --- a/drivers/net/ethernet/intel/igb/igb.h +++ b/drivers/net/ethernet/intel/igb/igb.h @@ -281,6 +281,11 @@ struct igb_ring { u16 count; /* number of desc. in the ring */ u8 queue_index; /* logical index of the ring*/ u8 reg_idx; /* physical index of the ring */ + bool cbs_enable;/* indicates if CBS is enabled */ + s32 idleslope; /* idleSlope in kbps */ + s32 sendslope; /* sendSlope in kbps */ + s32 hicredit; /* hiCredit in bytes */ + s32 locredit; /* loCredit in bytes */ /* everything past this point are written often */ u16 next_to_clean; @@ -621,6 +626,7 @@ struct igb_adapter { #define IGB_FLAG_EEE BIT(14) #define IGB_FLAG_VLAN_PROMISC BIT(15) #define IGB_FLAG_RX_LEGACY BIT(16) +#define IGB_FLAG_FQTSS BIT(17) /* Media Auto Sense */ #define IGB_MAS_ENABLE_0 0X0001 diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index 837d9b46a390..be2cf263efa9 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/
[next-queue PATCH v5 2/5] mqprio: Implement select_queue class_ops
From: Jesus Sanchez-Palencia When replacing a child qdisc from mqprio, tc_modify_qdisc() must fetch the netdev_queue pointer that the current child qdisc is associated with before creating the new qdisc. Currently, when using mqprio as root qdisc, the kernel will end up getting the queue #0 pointer from the mqprio (root qdisc), which leaves any new child qdisc with a possibly wrong netdev_queue pointer. Implementing the Qdisc_class_ops select_queue() on mqprio fixes this issue and avoid an inconsistent state when child qdiscs are replaced. Signed-off-by: Jesus Sanchez-Palencia --- net/sched/sch_mqprio.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c index 6bcdfe6e7b63..8c042ae323e3 100644 --- a/net/sched/sch_mqprio.c +++ b/net/sched/sch_mqprio.c @@ -396,6 +396,12 @@ static void mqprio_walk(struct Qdisc *sch, struct qdisc_walker *arg) } } +static struct netdev_queue *mqprio_select_queue(struct Qdisc *sch, + struct tcmsg *tcm) +{ + return mqprio_queue_get(sch, TC_H_MIN(tcm->tcm_parent)); +} + static const struct Qdisc_class_ops mqprio_class_ops = { .graft = mqprio_graft, .leaf = mqprio_leaf, @@ -403,6 +409,7 @@ static const struct Qdisc_class_ops mqprio_class_ops = { .walk = mqprio_walk, .dump = mqprio_dump_class, .dump_stats = mqprio_dump_class_stats, + .select_queue = mqprio_select_queue, }; static struct Qdisc_ops mqprio_qdisc_ops __read_mostly = { -- 2.14.2
[jkirsher/next-queue PATCH v4 4/6] i40e: Admin queue definitions for cloud filters
Add new admin queue definitions and extended fields for cloud filter support. Define big buffer for extended general fields in Add/Remove Cloud filters command. v3: Shortened some lengthy struct names. v2: Added I40E_CHECK_STRUCT_LEN check to AQ command structs and added AQ definitions to i40evf for consistency based on Shannon's feedback. Signed-off-by: Amritha Nambiar Signed-off-by: Kiran Patil Signed-off-by: Jingjing Wu --- drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h | 110 .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h| 110 2 files changed, 216 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h index 729976b..bcc7986 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h @@ -1371,14 +1371,16 @@ struct i40e_aqc_add_remove_cloud_filters { #define I40E_AQC_ADD_CLOUD_CMD_SEID_NUM_SHIFT 0 #define I40E_AQC_ADD_CLOUD_CMD_SEID_NUM_MASK (0x3FF << \ I40E_AQC_ADD_CLOUD_CMD_SEID_NUM_SHIFT) - u8 reserved2[4]; + u8 big_buffer_flag; +#define I40E_AQC_ADD_CLOUD_CMD_BB 1 + u8 reserved2[3]; __le32 addr_high; __le32 addr_low; }; I40E_CHECK_CMD_LENGTH(i40e_aqc_add_remove_cloud_filters); -struct i40e_aqc_add_remove_cloud_filters_element_data { +struct i40e_aqc_cloud_filters_element_data { u8 outer_mac[6]; u8 inner_mac[6]; __le16 inner_vlan; @@ -1408,6 +1410,13 @@ struct i40e_aqc_add_remove_cloud_filters_element_data { #define I40E_AQC_ADD_CLOUD_FILTER_IMAC 0x000A #define I40E_AQC_ADD_CLOUD_FILTER_OMAC_TEN_ID_IMAC 0x000B #define I40E_AQC_ADD_CLOUD_FILTER_IIP 0x000C +/* 0x0010 to 0x0017 is for custom filters */ +/* flag to be used when adding cloud filter: IP + L4 Port */ +#define I40E_AQC_ADD_CLOUD_FILTER_IP_PORT 0x0010 +/* flag to be used when adding cloud filter: Dest MAC + L4 Port */ +#define I40E_AQC_ADD_CLOUD_FILTER_MAC_PORT 0x0011 +/* flag to be used when adding cloud filter: Dest MAC + VLAN + L4 Port */ +#define I40E_AQC_ADD_CLOUD_FILTER_MAC_VLAN_PORT0x0012 #define I40E_AQC_ADD_CLOUD_FLAGS_TO_QUEUE 0x0080 #define I40E_AQC_ADD_CLOUD_VNK_SHIFT 6 @@ -1442,6 +1451,49 @@ struct i40e_aqc_add_remove_cloud_filters_element_data { u8 response_reserved[7]; }; +I40E_CHECK_STRUCT_LEN(0x40, i40e_aqc_cloud_filters_element_data); + +/* i40e_aqc_cloud_filters_element_bb is used when + * I40E_AQC_CLOUD_CMD_BB flag is set. + */ +struct i40e_aqc_cloud_filters_element_bb { + struct i40e_aqc_cloud_filters_element_data element; + u16 general_fields[32]; +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X10_WORD0 0 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X10_WORD1 1 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X10_WORD2 2 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X11_WORD0 3 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X11_WORD1 4 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X11_WORD2 5 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X12_WORD0 6 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X12_WORD1 7 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X12_WORD2 8 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X13_WORD0 9 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X13_WORD1 10 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X13_WORD2 11 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X14_WORD0 12 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X14_WORD1 13 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X14_WORD2 14 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD0 15 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD1 16 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD2 17 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD3 18 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD4 19 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD5 20 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD6 21 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD7 22 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD0 23 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD1 24 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD2 25 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD3 26 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD4 27 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD5 28 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD6 29 +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD7 30 +}; + +I40E_CHECK_STRUCT_LEN(0x80, i40e_aqc_cloud_filters_element_bb); + struct i40e_aqc_remove_cloud_filters_completion { __le16 perfect_ovlan_used; __le16 perfect_ovlan_free; @@ -1453,6 +1505,60 @@ struct i40e_aqc_remove_cloud_filters_completion { I40E_CHECK_CMD_LENGTH(i40e_aqc_remove_cloud_filters_completion); +/* Replace filter Command 0x025F + * uses the i40e_aqc_replace_cloud_filters, + * and the generic indirect completion structure + */ +struct i40e_filter_data { + u8 filter_type; + u8 input[3]; +}; + +I40
[jkirsher/next-queue PATCH v4 3/6] i40e: Cloud filter mode for set_switch_config command
Add definitions for L4 filters and switch modes based on cloud filters modes and extend the set switch config command to include the additional cloud filter mode. Signed-off-by: Amritha Nambiar Signed-off-by: Kiran Patil --- drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h | 30 - drivers/net/ethernet/intel/i40e/i40e_common.c |4 ++- drivers/net/ethernet/intel/i40e/i40e_ethtool.c|2 + drivers/net/ethernet/intel/i40e/i40e_main.c |2 + drivers/net/ethernet/intel/i40e/i40e_prototype.h |2 + drivers/net/ethernet/intel/i40e/i40e_type.h |9 ++ 6 files changed, 44 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h index 6a5db1b..729976b 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h @@ -790,7 +790,35 @@ struct i40e_aqc_set_switch_config { */ __le16 first_tag; __le16 second_tag; - u8 reserved[6]; + /* Next byte is split into following: +* Bit 7 : 0: No action, 1: Switch to mode defined by bits 6:0 +* Bit 6: 0 : Destination Port, 1: source port +* Bit 5..4: L4 type +* 0: rsvd +* 1: TCP +* 2: UDP +* 3: Both TCP and UDP +* Bits 3:0 Mode +* 0: default mode +* 1: L4 port only mode +* 2: non-tunneled mode +* 3: tunneled mode +*/ +#define I40E_AQ_SET_SWITCH_BIT7_VALID 0x80 + +#define I40E_AQ_SET_SWITCH_L4_SRC_PORT 0x40 + +#define I40E_AQ_SET_SWITCH_L4_TYPE_RSVD0x00 +#define I40E_AQ_SET_SWITCH_L4_TYPE_TCP 0x10 +#define I40E_AQ_SET_SWITCH_L4_TYPE_UDP 0x20 +#define I40E_AQ_SET_SWITCH_L4_TYPE_BOTH0x30 + +#define I40E_AQ_SET_SWITCH_MODE_DEFAULT0x00 +#define I40E_AQ_SET_SWITCH_MODE_L4_PORT0x01 +#define I40E_AQ_SET_SWITCH_MODE_NON_TUNNEL 0x02 +#define I40E_AQ_SET_SWITCH_MODE_TUNNEL 0x03 + u8 mode; + u8 rsvd5[5]; }; I40E_CHECK_CMD_LENGTH(i40e_aqc_set_switch_config); diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c b/drivers/net/ethernet/intel/i40e/i40e_common.c index 1b85eb3..0b3c5b7 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_common.c +++ b/drivers/net/ethernet/intel/i40e/i40e_common.c @@ -2402,13 +2402,14 @@ i40e_status i40e_aq_get_switch_config(struct i40e_hw *hw, * @hw: pointer to the hardware structure * @flags: bit flag values to set * @valid_flags: which bit flags to set + * @mode: cloud filter mode * @cmd_details: pointer to command details structure or NULL * * Set switch configuration bits **/ enum i40e_status_code i40e_aq_set_switch_config(struct i40e_hw *hw, u16 flags, - u16 valid_flags, + u16 valid_flags, u8 mode, struct i40e_asq_cmd_details *cmd_details) { struct i40e_aq_desc desc; @@ -2420,6 +2421,7 @@ enum i40e_status_code i40e_aq_set_switch_config(struct i40e_hw *hw, i40e_aqc_opc_set_switch_config); scfg->flags = cpu_to_le16(flags); scfg->valid_flags = cpu_to_le16(valid_flags); + scfg->mode = mode; if (hw->flags & I40E_HW_FLAG_802_1AD_CAPABLE) { scfg->switch_tag = cpu_to_le16(hw->switch_tag); scfg->first_tag = cpu_to_le16(hw->first_tag); diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c index a760d75..37ca294 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c @@ -4341,7 +4341,7 @@ static int i40e_set_priv_flags(struct net_device *dev, u32 flags) sw_flags = I40E_AQ_SET_SWITCH_CFG_PROMISC; valid_flags = I40E_AQ_SET_SWITCH_CFG_PROMISC; ret = i40e_aq_set_switch_config(&pf->hw, sw_flags, valid_flags, - NULL); + 0, NULL); if (ret && pf->hw.aq.asq_last_status != I40E_AQ_RC_ESRCH) { dev_info(&pf->pdev->dev, "couldn't set switch config bits, err %s aq_err %s\n", diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 33a8f429..0539d43 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -12165,7 +12165,7 @@ static int i40e_setup_pf_switch(struct i40e_pf *pf, bool reinit) u16 valid_flags; valid_flags = I40E_AQ_SET_SWITCH_CFG_PROMISC; - ret = i40e_aq_set_switch_config(&pf->hw, flags, valid_flags, +
[jkirsher/next-queue PATCH v4 1/6] cls_flower: Offload classid to hardware
The classid on a filter is used to match a packet to a class. tcf_result structure contains the class ID of the class to which the packet belongs. This patch enables offloading the classid to the hardware. Signed-off-by: Amritha Nambiar --- include/net/pkt_cls.h |1 + net/sched/cls_flower.c |2 ++ 2 files changed, 3 insertions(+) diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index 456017a..c2f847f 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -515,6 +515,7 @@ struct tc_cls_flower_offload { struct fl_flow_key *key; struct tcf_exts *exts; bool egress_dev; + u32 classid; }; enum tc_matchall_command { diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index db831ac..50c8a52 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -241,6 +241,7 @@ static int fl_hw_replace_filter(struct tcf_proto *tp, cls_flower.mask = mask; cls_flower.key = &f->mkey; cls_flower.exts = &f->exts; + cls_flower.classid = f->res.classid; err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_CLSFLOWER, &cls_flower); @@ -264,6 +265,7 @@ static void fl_hw_update_stats(struct tcf_proto *tp, struct cls_fl_filter *f) cls_flower.command = TC_CLSFLOWER_STATS; cls_flower.cookie = (unsigned long) f; cls_flower.exts = &f->exts; + cls_flower.classid = f->res.classid; dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_CLSFLOWER, &cls_flower);
[jkirsher/next-queue PATCH v4 0/6] tc-flower based cloud filters in i40e
This patch series enables configuring cloud filters in i40e using the tc-flower classifier. The classification function of the filter is to match a packet to a class. cls_flower is extended to offload classid to hardware. The offloaded classid is used direct matched packets to a traffic class on the device. The approach here is similar to the tc 'prio' qdisc which uses the classid for band selection. The ingress qdisc is called :0, so traffic classes are :1 to :8 (i40e has max of 8 TCs). TC0 is minor number 1, TC1 is minor number 2 etc. The cloud filters are added for a VSI and are cleaned up when the VSI is deleted. The filters that match on L4 ports needs enhanced admin queue functions with big buffer support for extended fields in cloud filter commands. Example: # tc qdisc add dev eth0 ingress # ethtool -K eth0 hw-tc-offload on Match Dst IPv4,Dst Port and route to TC1: # tc filter add dev eth0 protocol ip parent : prio 1 flower\ dst_ip 192.168.1.1/32 ip_proto udp dst_port 22\ skip_sw classid :2 # tc filter show dev eth0 parent : filter pref 1 flower chain 0 filter pref 1 flower chain 0 handle 0x1 classid :2 eth_type ipv4 ip_proto udp dst_ip 192.168.1.1 dst_port 22 skip_sw in_hw v4: classid based approach to set traffic class for matched packets. Authors: Amritha Nambiar Kiran Patil Anjali Singhai Jain Jingjing Wu --- Amritha Nambiar (6): cls_flower: Offload classid to hardware i40e: Map TCs with the VSI seids i40e: Cloud filter mode for set_switch_config command i40e: Admin queue definitions for cloud filters i40e: Clean up of cloud filters i40e: Enable cloud filters via tc-flower drivers/net/ethernet/intel/i40e/i40e.h | 55 + drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h | 143 +++ drivers/net/ethernet/intel/i40e/i40e_common.c | 193 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |2 drivers/net/ethernet/intel/i40e/i40e_main.c| 941 +++- drivers/net/ethernet/intel/i40e/i40e_prototype.h | 18 drivers/net/ethernet/intel/i40e/i40e_type.h| 10 .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h| 113 ++ include/net/pkt_cls.h |1 net/sched/cls_flower.c |2 10 files changed, 1439 insertions(+), 39 deletions(-) --
[jkirsher/next-queue PATCH v4 5/6] i40e: Clean up of cloud filters
Introduce the cloud filter datastructure and cleanup of cloud filters associated with the device. v2: Moved field comments in struct i40e_cloud_filter to the right. Removed hlist_empty check from i40e_cloud_filter_exit() Signed-off-by: Amritha Nambiar --- drivers/net/ethernet/intel/i40e/i40e.h |9 + drivers/net/ethernet/intel/i40e/i40e_main.c | 24 2 files changed, 33 insertions(+) diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h index f3c501e..b938bb4a 100644 --- a/drivers/net/ethernet/intel/i40e/i40e.h +++ b/drivers/net/ethernet/intel/i40e/i40e.h @@ -253,6 +253,12 @@ struct i40e_fdir_filter { u32 fd_id; }; +struct i40e_cloud_filter { + struct hlist_node cloud_node; + unsigned long cookie; + u16 seid; /* filter control */ +}; + #define I40E_ETH_P_LLDP0x88cc #define I40E_DCB_PRIO_TYPE_STRICT 0 @@ -420,6 +426,9 @@ struct i40e_pf { struct i40e_udp_port_config udp_ports[I40E_MAX_PF_UDP_OFFLOAD_PORTS]; u16 pending_udp_bitmap; + struct hlist_head cloud_filter_list; + u16 num_cloud_filters; + enum i40e_interrupt_policy int_policy; u16 rx_itr_default; u16 tx_itr_default; diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 0539d43..bcdb16a 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -6937,6 +6937,26 @@ static void i40e_fdir_filter_exit(struct i40e_pf *pf) } /** + * i40e_cloud_filter_exit - Cleans up the Cloud Filters + * @pf: Pointer to PF + * + * This function destroys the hlist where all the Cloud Filters + * filters were saved. + **/ +static void i40e_cloud_filter_exit(struct i40e_pf *pf) +{ + struct i40e_cloud_filter *cfilter; + struct hlist_node *node; + + hlist_for_each_entry_safe(cfilter, node, + &pf->cloud_filter_list, cloud_node) { + hlist_del(&cfilter->cloud_node); + kfree(cfilter); + } + pf->num_cloud_filters = 0; +} + +/** * i40e_close - Disables a network interface * @netdev: network interface device structure * @@ -12195,6 +12215,7 @@ static int i40e_setup_pf_switch(struct i40e_pf *pf, bool reinit) vsi = i40e_vsi_reinit_setup(pf->vsi[pf->lan_vsi]); if (!vsi) { dev_info(&pf->pdev->dev, "setup of MAIN VSI failed\n"); + i40e_cloud_filter_exit(pf); i40e_fdir_teardown(pf); return -EAGAIN; } @@ -13029,6 +13050,8 @@ static void i40e_remove(struct pci_dev *pdev) if (pf->vsi[pf->lan_vsi]) i40e_vsi_release(pf->vsi[pf->lan_vsi]); + i40e_cloud_filter_exit(pf); + /* remove attached clients */ if (pf->flags & I40E_FLAG_IWARP_ENABLED) { ret_code = i40e_lan_del_device(pf); @@ -13260,6 +13283,7 @@ static void i40e_shutdown(struct pci_dev *pdev) del_timer_sync(&pf->service_timer); cancel_work_sync(&pf->service_task); + i40e_cloud_filter_exit(pf); i40e_fdir_teardown(pf); /* Client close must be called explicitly here because the timer
[jkirsher/next-queue PATCH v4 6/6] i40e: Enable cloud filters via tc-flower
This patch enables tc-flower based hardware offloads. tc flower filter provided by the kernel is configured as driver specific cloud filter. The patch implements functions and admin queue commands needed to support cloud filters in the driver and adds cloud filters to configure these tc-flower filters. The classification function of the filter is to direct matched packets to a traffic class which is set based on the offloaded tc-flower classid. The approach here is similar to the tc 'prio' qdisc which uses the classid for band selection. The ingress qdisc is called :0, so traffic classes are :1 to :8 (i40e has max of 8 TCs). TC0 is minor number 1, TC1 is minor number 2 etc. # tc qdisc add dev eth0 ingress # ethtool -K eth0 hw-tc-offload on Match Dst MAC and route to TC0: # tc filter add dev eth0 protocol ip parent :\ prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw\ classid :1 Match Dst IPv4,Dst Port and route to TC1: # tc filter add dev eth0 protocol ip parent :\ prio 2 flower dst_ip 192.168.3.5/32\ ip_proto udp dst_port 25 skip_sw\ classid :2 Match Dst IPv6,Dst Port and route to TC1: # tc filter add dev eth0 protocol ipv6 parent :\ prio 3 flower dst_ip fe8::200:1\ ip_proto udp dst_port 66 skip_sw\ classid :2 Delete tc flower filter: Example: # tc filter del dev eth0 parent : prio 3 handle 0x1 flower # tc filter del dev eth0 parent : Flow Director Sideband is disabled while configuring cloud filters via tc-flower and until any cloud filter exists. Unsupported matches when cloud filters are added using enhanced big buffer cloud filter mode of underlying switch include: 1. source port and source IP 2. Combined MAC address and IP fields. 3. Not specifying L4 port These filter matches can however be used to redirect traffic to the main VSI (tc 0) which does not require the enhanced big buffer cloud filter support. v4: Use classid to set traffic class for matched packets. Do not allow disabling hw-tc-offloads when offloaded tc filters are active. v3: Cleaned up some lengthy function names. Changed ipv6 address to __be32 array instead of u8 array. Used macro for IP version. Minor formatting changes. v2: 1. Moved I40E_SWITCH_MODE_MASK definition to i40e_type.h 2. Moved dev_info for add/deleting cloud filters in else condition 3. Fixed some format specifier in dev_err logs 4. Refactored i40e_get_capabilities to take an additional list_type parameter and use it to query device and function level capabilities. 5. Fixed parsing tc redirect action to check for the is_tcf_mirred_tc() to verify if redirect to a traffic class is supported. 6. Added comments for Geneve fix in cloud filter big buffer AQ function definitions. 7. Cleaned up setup_tc interface to rebase and work with Jiri's updates, separate function to process tc cls flower offloads. 8. Changes to make Flow Director Sideband and Cloud filters mutually exclusive. Signed-off-by: Amritha Nambiar Signed-off-by: Kiran Patil Signed-off-by: Anjali Singhai Jain Signed-off-by: Jingjing Wu --- drivers/net/ethernet/intel/i40e/i40e.h | 45 + drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h |3 drivers/net/ethernet/intel/i40e/i40e_common.c | 189 drivers/net/ethernet/intel/i40e/i40e_main.c| 913 +++- drivers/net/ethernet/intel/i40e/i40e_prototype.h | 16 drivers/net/ethernet/intel/i40e/i40e_type.h|1 .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h|3 7 files changed, 1140 insertions(+), 30 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h index b938bb4a..c3f1312 100644 --- a/drivers/net/ethernet/intel/i40e/i40e.h +++ b/drivers/net/ethernet/intel/i40e/i40e.h @@ -55,6 +55,8 @@ #include #include #include +#include +#include #include "i40e_type.h" #include "i40e_prototype.h" #include "i40e_client.h" @@ -253,9 +255,48 @@ struct i40e_fdir_filter { u32 fd_id; }; +#define IPV4_VERSION 4 +#define IPV6_VERSION 6 + +#define I40E_CLOUD_FIELD_OMAC 0x01 +#define I40E_CLOUD_FIELD_IMAC 0x02 +#define I40E_CLOUD_FIELD_IVLAN 0x04 +#define I40E_CLOUD_FIELD_TEN_ID0x08 +#define I40E_CLOUD_FIELD_IIP 0x10 + +#define I40E_CLOUD_FILTER_FLAGS_OMAC I40E_CLOUD_FIELD_OMAC +#define I40E_CLOUD_FILTER_FLAGS_IMAC I40E_CLOUD_FIELD_IMAC +#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN (I40E_CLOUD_FIELD_IMAC | \ +I40E_CLOUD_FIELD_IVLAN) +#define I40E_CLOUD_FILTER_FLAGS_IMAC_TEN_ID(I40E_CLOUD_FIELD_IMAC | \ +I40E_CLOUD_FIELD_TEN_ID) +#define I40E_CLOUD_FILTER_FLAGS_OMAC_TEN_ID_IMAC (I40E_CLOUD_FIELD_OMAC | \ + I40E_CLOUD_FIELD_IMAC | \ + I40E_CLOUD_FIELD_TEN_ID) +#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN_TEN_ID (I40E_CLOUD_FIELD_IMAC | \ +
[jkirsher/next-queue PATCH v4 2/6] i40e: Map TCs with the VSI seids
Add mapping of TCs with the seids of the channel VSIs. TC0 will be mapped to the main VSI seid and all other TCs are mapped to the seid of the corresponding channel VSI. Signed-off-by: Amritha Nambiar --- drivers/net/ethernet/intel/i40e/i40e.h |1 + drivers/net/ethernet/intel/i40e/i40e_main.c |2 ++ 2 files changed, 3 insertions(+) diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h index eb01776..f3c501e 100644 --- a/drivers/net/ethernet/intel/i40e/i40e.h +++ b/drivers/net/ethernet/intel/i40e/i40e.h @@ -739,6 +739,7 @@ struct i40e_vsi { u16 next_base_queue;/* next queue to be used for channel setup */ struct list_head ch_list; + u16 tc_seid_map[I40E_MAX_TRAFFIC_CLASS]; void *priv; /* client driver data reference. */ diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 75f944f..33a8f429 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -6100,6 +6100,7 @@ static int i40e_configure_queue_channels(struct i40e_vsi *vsi) int ret = 0, i; /* Create app vsi with the TCs. Main VSI with TC0 is already set up */ + vsi->tc_seid_map[0] = vsi->seid; for (i = 1; i < I40E_MAX_TRAFFIC_CLASS; i++) { if (vsi->tc_config.enabled_tc & BIT(i)) { ch = kzalloc(sizeof(*ch), GFP_KERNEL); @@ -6130,6 +6131,7 @@ static int i40e_configure_queue_channels(struct i40e_vsi *vsi) i, ch->num_queue_pairs); goto err_free; } + vsi->tc_seid_map[i] = ch->seid; } } return ret;
[PATCH net 3/7] net: qualcomm: rmnet: Move rmnet_mode to rmnet_port
Mode information on the real device makes it easier to route packets to rmnet device or bridged device based on the configuration. Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation") Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 12 +--- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 2 +- drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 3 +-- 3 files changed, 7 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c index 1e33aea..63f6c9c 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c @@ -124,20 +124,17 @@ static int rmnet_register_real_device(struct net_device *real_dev) } static void rmnet_set_endpoint_config(struct net_device *dev, - u8 mux_id, u8 rmnet_mode, - struct net_device *egress_dev) + u8 mux_id, struct net_device *egress_dev) { struct rmnet_endpoint *ep; - netdev_dbg(dev, "id %d mode %d dev %s\n", - mux_id, rmnet_mode, egress_dev->name); + netdev_dbg(dev, "id %d dev %s\n", mux_id, egress_dev->name); ep = rmnet_get_endpoint(dev, mux_id); /* This config is cleared on every set, so its ok to not * clear it on a device delete. */ memset(ep, 0, sizeof(struct rmnet_endpoint)); - ep->rmnet_mode = rmnet_mode; ep->egress_dev = egress_dev; ep->mux_id = mux_id; } @@ -183,9 +180,10 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, ingress_format, egress_format); port->egress_data_format = egress_format; port->ingress_data_format = ingress_format; + port->rmnet_mode = mode; - rmnet_set_endpoint_config(real_dev, mux_id, mode, dev); - rmnet_set_endpoint_config(dev, mux_id, mode, real_dev); + rmnet_set_endpoint_config(real_dev, mux_id, dev); + rmnet_set_endpoint_config(dev, mux_id, real_dev); return 0; err2: diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h index 0b0c5a7..03d473f 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h @@ -24,7 +24,6 @@ * Exact usage of this parameter depends on the rmnet_mode. */ struct rmnet_endpoint { - u8 rmnet_mode; u8 mux_id; struct net_device *egress_dev; }; @@ -39,6 +38,7 @@ struct rmnet_port { u32 egress_data_format; struct net_device *rmnet_devices[RMNET_MAX_LOGICAL_EP]; u8 nr_rmnet_devs; + u8 rmnet_mode; }; extern struct rtnl_link_ops rmnet_link_ops; diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c index b50f401..86e37cc 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c @@ -205,8 +205,7 @@ void rmnet_egress_handler(struct sk_buff *skb, } } - if (ep->rmnet_mode == RMNET_EPMODE_VND) - rmnet_vnd_tx_fixup(skb, orig_dev); + rmnet_vnd_tx_fixup(skb, orig_dev); dev_queue_xmit(skb); } -- 1.9.1
[PATCH net 4/7] net: qualcomm: rmnet: Remove duplicate setting of rmnet private info
The end point is set twice in the local_ep as well as the mux_id and the real_dev in the rmnet private structure. Remove the local_ep. While these elements are equivalent, rmnet_endpoint will be used only as part of the rmnet_port for muxed scenarios in VND mode. Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation") Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 10 ++ drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 4 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 18 ++ drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.h | 3 +-- drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c | 19 ++- drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h | 1 - 6 files changed, 15 insertions(+), 40 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c index 63f6c9c..208adf8 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c @@ -67,13 +67,8 @@ static int rmnet_is_real_dev_registered(const struct net_device *real_dev) struct rmnet_endpoint *ep; struct rmnet_port *port; - if (!rmnet_is_real_dev_registered(dev)) { - ep = rmnet_vnd_get_endpoint(dev); - } else { - port = rmnet_get_port_rtnl(dev); - - ep = &port->muxed_ep[config_id]; - } + port = rmnet_get_port_rtnl(dev); + ep = &port->muxed_ep[config_id]; return ep; } @@ -183,7 +178,6 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, port->rmnet_mode = mode; rmnet_set_endpoint_config(real_dev, mux_id, dev); - rmnet_set_endpoint_config(dev, mux_id, real_dev); return 0; err2: diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h index 03d473f..c5f5c6d 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h @@ -20,9 +20,6 @@ #define RMNET_MAX_LOGICAL_EP 255 -/* Information about the next device to deliver the packet to. - * Exact usage of this parameter depends on the rmnet_mode. - */ struct rmnet_endpoint { u8 mux_id; struct net_device *egress_dev; @@ -44,7 +41,6 @@ struct rmnet_port { extern struct rtnl_link_ops rmnet_link_ops; struct rmnet_priv { - struct rmnet_endpoint local_ep; u8 mux_id; struct net_device *real_dev; }; diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c index 86e37cc..e0802d3 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c @@ -116,8 +116,7 @@ static void rmnet_set_skb_proto(struct sk_buff *skb) } static int rmnet_map_egress_handler(struct sk_buff *skb, - struct rmnet_port *port, - struct rmnet_endpoint *ep, + struct rmnet_port *port, u8 mux_id, struct net_device *orig_dev) { int required_headroom, additional_header_len; @@ -136,10 +135,10 @@ static int rmnet_map_egress_handler(struct sk_buff *skb, return RMNET_MAP_CONSUMED; if (port->egress_data_format & RMNET_EGRESS_FORMAT_MUXING) { - if (ep->mux_id == 0xff) + if (mux_id == 0xff) map_header->mux_id = 0; else - map_header->mux_id = ep->mux_id; + map_header->mux_id = mux_id; } skb->protocol = htons(ETH_P_MAP); @@ -176,14 +175,17 @@ rx_handler_result_t rmnet_rx_handler(struct sk_buff **pskb) * for egress device configured in logical endpoint. Packet is then transmitted * on the egress device. */ -void rmnet_egress_handler(struct sk_buff *skb, - struct rmnet_endpoint *ep) +void rmnet_egress_handler(struct sk_buff *skb) { struct net_device *orig_dev; struct rmnet_port *port; + struct rmnet_priv *priv; + u8 mux_id; orig_dev = skb->dev; - skb->dev = ep->egress_dev; + priv = netdev_priv(orig_dev); + skb->dev = priv->real_dev; + mux_id = priv->mux_id; port = rmnet_get_port(skb->dev); if (!port) { @@ -192,7 +194,7 @@ void rmnet_egress_handler(struct sk_buff *skb, } if (port->egress_data_format & RMNET_EGRESS_FORMAT_MAP) { - switch (rmnet_map_egress_handler(skb, port, ep, orig_dev)) { + switch (rmnet_map_egress_handler(skb, port, mux_id, orig_dev)) { case RMNET_MAP_CONSUMED: return; diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmne
[PATCH net 6/7] net: qualcomm: rmnet: Convert the muxed endpoint to hlist
Rather than using a static array, use a hlist to store the muxed endpoints and use the mux id to query the rmnet_device. This is useful as usually very few mux ids are used. Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation") Signed-off-by: Subash Abhinov Kasiviswanathan Cc: Dan Williams --- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 75 -- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 4 +- .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 17 +++-- .../ethernet/qualcomm/rmnet/rmnet_map_command.c| 4 +- drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c| 15 +++-- drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h| 6 +- 6 files changed, 68 insertions(+), 53 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c index 208adf8..1c93b65 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c @@ -61,18 +61,6 @@ static int rmnet_is_real_dev_registered(const struct net_device *real_dev) return rtnl_dereference(real_dev->rx_handler_data); } -static struct rmnet_endpoint* -rmnet_get_endpoint(struct net_device *dev, int config_id) -{ - struct rmnet_endpoint *ep; - struct rmnet_port *port; - - port = rmnet_get_port_rtnl(dev); - ep = &port->muxed_ep[config_id]; - - return ep; -} - static int rmnet_unregister_real_device(struct net_device *real_dev, struct rmnet_port *port) { @@ -93,7 +81,7 @@ static int rmnet_unregister_real_device(struct net_device *real_dev, static int rmnet_register_real_device(struct net_device *real_dev) { struct rmnet_port *port; - int rc; + int rc, entry; ASSERT_RTNL(); @@ -114,26 +102,13 @@ static int rmnet_register_real_device(struct net_device *real_dev) /* hold on to real dev for MAP data */ dev_hold(real_dev); + for (entry = 0; entry < RMNET_MAX_LOGICAL_EP; entry++) + INIT_HLIST_HEAD(&port->muxed_ep[entry]); + netdev_dbg(real_dev, "registered with rmnet\n"); return 0; } -static void rmnet_set_endpoint_config(struct net_device *dev, - u8 mux_id, struct net_device *egress_dev) -{ - struct rmnet_endpoint *ep; - - netdev_dbg(dev, "id %d dev %s\n", mux_id, egress_dev->name); - - ep = rmnet_get_endpoint(dev, mux_id); - /* This config is cleared on every set, so its ok to not -* clear it on a device delete. -*/ - memset(ep, 0, sizeof(struct rmnet_endpoint)); - ep->egress_dev = egress_dev; - ep->mux_id = mux_id; -} - static int rmnet_newlink(struct net *src_net, struct net_device *dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) @@ -145,6 +120,7 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, RMNET_EGRESS_FORMAT_MAP; struct net_device *real_dev; int mode = RMNET_EPMODE_VND; + struct rmnet_endpoint *ep; struct rmnet_port *port; int err = 0; u16 mux_id; @@ -156,6 +132,10 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, if (!data[IFLA_VLAN_ID]) return -EINVAL; + ep = kzalloc(sizeof(*ep), GFP_ATOMIC); + if (!ep) + return -ENOMEM; + mux_id = nla_get_u16(data[IFLA_VLAN_ID]); err = rmnet_register_real_device(real_dev); @@ -163,7 +143,7 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, goto err0; port = rmnet_get_port_rtnl(real_dev); - err = rmnet_vnd_newlink(mux_id, dev, port, real_dev); + err = rmnet_vnd_newlink(mux_id, dev, port, real_dev, ep); if (err) goto err1; @@ -177,11 +157,11 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, port->ingress_data_format = ingress_format; port->rmnet_mode = mode; - rmnet_set_endpoint_config(real_dev, mux_id, dev); + hlist_add_head_rcu(&ep->hlnode, &port->muxed_ep[mux_id]); return 0; err2: - rmnet_vnd_dellink(mux_id, port); + rmnet_vnd_dellink(mux_id, port, ep); err1: rmnet_unregister_real_device(real_dev, port); err0: @@ -191,6 +171,7 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, static void rmnet_dellink(struct net_device *dev, struct list_head *head) { struct net_device *real_dev; + struct rmnet_endpoint *ep; struct rmnet_port *port; u8 mux_id; @@ -204,8 +185,15 @@ static void rmnet_dellink(struct net_device *dev, struct list_head *head) port = rmnet_get_port_rtnl(real_dev); mux_id = rmnet_vnd_get_mux(dev); - rmnet_vnd_dellink(mux_id, por
[PATCH net 1/7] net: qualcomm: rmnet: Remove existing logic for bridge mode
This will be rewritten in the following patches. Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation") Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 1 - .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 77 +++--- 2 files changed, 9 insertions(+), 69 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h index dde4e9f..0b0c5a7 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h @@ -34,7 +34,6 @@ struct rmnet_endpoint { */ struct rmnet_port { struct net_device *dev; - struct rmnet_endpoint local_ep; struct rmnet_endpoint muxed_ep[RMNET_MAX_LOGICAL_EP]; u32 ingress_data_format; u32 egress_data_format; diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c index 540c762..b50f401 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c @@ -44,56 +44,18 @@ static void rmnet_set_skb_proto(struct sk_buff *skb) /* Generic handler */ static rx_handler_result_t -rmnet_bridge_handler(struct sk_buff *skb, struct rmnet_endpoint *ep) +rmnet_deliver_skb(struct sk_buff *skb) { - if (!ep->egress_dev) - kfree_skb(skb); - else - rmnet_egress_handler(skb, ep); + skb_reset_transport_header(skb); + skb_reset_network_header(skb); + rmnet_vnd_rx_fixup(skb, skb->dev); + skb->pkt_type = PACKET_HOST; + skb_set_mac_header(skb, 0); + netif_receive_skb(skb); return RX_HANDLER_CONSUMED; } -static rx_handler_result_t -rmnet_deliver_skb(struct sk_buff *skb, struct rmnet_endpoint *ep) -{ - switch (ep->rmnet_mode) { - case RMNET_EPMODE_NONE: - return RX_HANDLER_PASS; - - case RMNET_EPMODE_BRIDGE: - return rmnet_bridge_handler(skb, ep); - - case RMNET_EPMODE_VND: - skb_reset_transport_header(skb); - skb_reset_network_header(skb); - rmnet_vnd_rx_fixup(skb, skb->dev); - - skb->pkt_type = PACKET_HOST; - skb_set_mac_header(skb, 0); - netif_receive_skb(skb); - return RX_HANDLER_CONSUMED; - - default: - kfree_skb(skb); - return RX_HANDLER_CONSUMED; - } -} - -static rx_handler_result_t -rmnet_ingress_deliver_packet(struct sk_buff *skb, -struct rmnet_port *port) -{ - if (!port) { - kfree_skb(skb); - return RX_HANDLER_CONSUMED; - } - - skb->dev = port->local_ep.egress_dev; - - return rmnet_deliver_skb(skb, &port->local_ep); -} - /* MAP handler */ static rx_handler_result_t @@ -130,7 +92,7 @@ static void rmnet_set_skb_proto(struct sk_buff *skb) skb_pull(skb, sizeof(struct rmnet_map_header)); skb_trim(skb, len); rmnet_set_skb_proto(skb); - return rmnet_deliver_skb(skb, ep); + return rmnet_deliver_skb(skb); } static rx_handler_result_t @@ -204,29 +166,8 @@ rx_handler_result_t rmnet_rx_handler(struct sk_buff **pskb) dev = skb->dev; port = rmnet_get_port(dev); - if (port->ingress_data_format & RMNET_INGRESS_FORMAT_MAP) { + if (port->ingress_data_format & RMNET_INGRESS_FORMAT_MAP) rc = rmnet_map_ingress_handler(skb, port); - } else { - switch (ntohs(skb->protocol)) { - case ETH_P_MAP: - if (port->local_ep.rmnet_mode == - RMNET_EPMODE_BRIDGE) { - rc = rmnet_ingress_deliver_packet(skb, port); - } else { - kfree_skb(skb); - rc = RX_HANDLER_CONSUMED; - } - break; - - case ETH_P_IP: - case ETH_P_IPV6: - rc = rmnet_ingress_deliver_packet(skb, port); - break; - - default: - rc = RX_HANDLER_PASS; - } - } return rc; } -- 1.9.1
[PATCH net 5/7] net: qualcomm: rmnet: Remove duplicate setting of rmnet_devices
The rmnet_devices information is already stored in muxed_ep, so storing this in rmnet_devices[] again is redundant. Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation") Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 1 - drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c| 8 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h index c5f5c6d..123ccf4 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h @@ -33,7 +33,6 @@ struct rmnet_port { struct rmnet_endpoint muxed_ep[RMNET_MAX_LOGICAL_EP]; u32 ingress_data_format; u32 egress_data_format; - struct net_device *rmnet_devices[RMNET_MAX_LOGICAL_EP]; u8 nr_rmnet_devs; u8 rmnet_mode; }; diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c index 4ca59a4..8b8497b 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c @@ -105,12 +105,12 @@ int rmnet_vnd_newlink(u8 id, struct net_device *rmnet_dev, struct rmnet_priv *priv; int rc; - if (port->rmnet_devices[id]) + if (port->muxed_ep[id].egress_dev) return -EINVAL; rc = register_netdevice(rmnet_dev); if (!rc) { - port->rmnet_devices[id] = rmnet_dev; + port->muxed_ep[id].egress_dev = rmnet_dev; port->nr_rmnet_devs++; rmnet_dev->rtnl_link_ops = &rmnet_link_ops; @@ -127,10 +127,10 @@ int rmnet_vnd_newlink(u8 id, struct net_device *rmnet_dev, int rmnet_vnd_dellink(u8 id, struct rmnet_port *port) { - if (id >= RMNET_MAX_LOGICAL_EP || !port->rmnet_devices[id]) + if (id >= RMNET_MAX_LOGICAL_EP || !port->muxed_ep[id].egress_dev) return -EINVAL; - port->rmnet_devices[id] = NULL; + port->muxed_ep[id].egress_dev = NULL; port->nr_rmnet_devs--; return 0; } -- 1.9.1
[PATCH net 7/7] net: qualcomm: rmnet: Implement bridge mode
Add support to bridge two devices which can send multiplexing and aggregation (MAP) data. This is done only when the data itself is not going to be consumed in the stack but is being passed on to a different endpoint. This is mainly used for testing. Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation") Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 91 +- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 6 +- .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 18 + drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c| 2 + 4 files changed, 115 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c index 1c93b65..682ab7c 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c @@ -109,6 +109,36 @@ static int rmnet_register_real_device(struct net_device *real_dev) return 0; } +static void rmnet_unregister_bridge(struct net_device *dev, + struct rmnet_port *port) +{ + struct net_device *rmnet_dev, *bridge_dev; + struct rmnet_port *bridge_port; + + if (port->rmnet_mode != RMNET_EPMODE_BRIDGE) + return; + + /* bridge slave handling */ + if (!port->nr_rmnet_devs) { + rmnet_dev = netdev_master_upper_dev_get_rcu(dev); + netdev_upper_dev_unlink(dev, rmnet_dev); + + bridge_dev = port->bridge_ep; + + bridge_port = rmnet_get_port_rtnl(bridge_dev); + bridge_port->bridge_ep = NULL; + bridge_port->rmnet_mode = RMNET_EPMODE_VND; + } else { + bridge_dev = port->bridge_ep; + + bridge_port = rmnet_get_port_rtnl(bridge_dev); + rmnet_dev = netdev_master_upper_dev_get_rcu(bridge_dev); + netdev_upper_dev_unlink(bridge_dev, rmnet_dev); + + rmnet_unregister_real_device(bridge_dev, bridge_port); + } +} + static int rmnet_newlink(struct net *src_net, struct net_device *dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) @@ -190,10 +220,10 @@ static void rmnet_dellink(struct net_device *dev, struct list_head *head) ep = rmnet_get_endpoint(port, mux_id); if (ep) { hlist_del_init_rcu(&ep->hlnode); + rmnet_unregister_bridge(dev, port); rmnet_vnd_dellink(mux_id, port, ep); kfree(ep); } - rmnet_unregister_real_device(real_dev, port); unregister_netdevice_queue(dev, head); @@ -237,6 +267,8 @@ static void rmnet_force_unassociate_device(struct net_device *dev) d.port = port; rcu_read_lock(); + rmnet_unregister_bridge(dev, port); + netdev_walk_all_lower_dev_rcu(real_dev, rmnet_dev_walk_unreg, &d); rcu_read_unlock(); unregister_netdevice_many(&list); @@ -321,6 +353,63 @@ struct rmnet_endpoint *rmnet_get_endpoint(struct rmnet_port *port, u8 mux_id) return NULL; } +int rmnet_add_bridge(struct net_device *rmnet_dev, +struct net_device *slave_dev) +{ + struct rmnet_priv *priv = netdev_priv(rmnet_dev); + struct net_device *real_dev = priv->real_dev; + struct rmnet_port *port, *slave_port; + int err; + + port = rmnet_get_port(real_dev); + + /* If there is more than one rmnet dev attached, its probably being +* used for muxing. Skip the briding in that case +*/ + if (port->nr_rmnet_devs > 1) + return -EINVAL; + + if (rmnet_is_real_dev_registered(slave_dev)) + return -EBUSY; + + err = rmnet_register_real_device(slave_dev); + if (err) + return -EBUSY; + + err = netdev_master_upper_dev_link(slave_dev, rmnet_dev, NULL, NULL); + if (err) + return -EINVAL; + + slave_port = rmnet_get_port(slave_dev); + slave_port->rmnet_mode = RMNET_EPMODE_BRIDGE; + slave_port->bridge_ep = real_dev; + + port->rmnet_mode = RMNET_EPMODE_BRIDGE; + port->bridge_ep = slave_dev; + + netdev_dbg(slave_dev, "registered with rmnet as slave\n"); + return 0; +} + +int rmnet_del_bridge(struct net_device *rmnet_dev, +struct net_device *slave_dev) +{ + struct rmnet_priv *priv = netdev_priv(rmnet_dev); + struct net_device *real_dev = priv->real_dev; + struct rmnet_port *port, *slave_port; + + port = rmnet_get_port(real_dev); + port->rmnet_mode = RMNET_EPMODE_VND; + port->bridge_ep = NULL; + + netdev_upper_dev_unlink(slave_dev, rmnet_dev); + slave_port = rmnet_get_port(slave_dev); + rmnet_unregister_real_device(slave_dev, slave_port); + +
[PATCH net 2/7] net: qualcomm: rmnet: Remove some unused defines
Most of these constants were used in the initial patchset where custom netlink configuration was used and hence are no longer relevant. Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation") Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h | 8 1 file changed, 8 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h index 7967198..49102f9 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h @@ -19,23 +19,15 @@ #define RMNET_TX_QUEUE_LEN 1000 /* Constants */ -#define RMNET_EGRESS_FORMAT__RESERVED__ BIT(0) #define RMNET_EGRESS_FORMAT_MAP BIT(1) #define RMNET_EGRESS_FORMAT_AGGREGATION BIT(2) #define RMNET_EGRESS_FORMAT_MUXING BIT(3) -#define RMNET_EGRESS_FORMAT_MAP_CKSUMV3 BIT(4) -#define RMNET_EGRESS_FORMAT_MAP_CKSUMV4 BIT(5) -#define RMNET_INGRESS_FIX_ETHERNET BIT(0) #define RMNET_INGRESS_FORMAT_MAPBIT(1) #define RMNET_INGRESS_FORMAT_DEAGGREGATION BIT(2) #define RMNET_INGRESS_FORMAT_DEMUXING BIT(3) #define RMNET_INGRESS_FORMAT_MAP_COMMANDS BIT(4) -#define RMNET_INGRESS_FORMAT_MAP_CKSUMV3BIT(5) -#define RMNET_INGRESS_FORMAT_MAP_CKSUMV4BIT(6) -/* Pass the frame up the stack with no modifications to skb->dev */ -#define RMNET_EPMODE_NONE (0) /* Replace skb->dev to a virtual rmnet device and pass up the stack */ #define RMNET_EPMODE_VND (1) /* Pass the frame directly to another device with dev_queue_xmit() */ -- 1.9.1
[PATCH net 0/7] net: qualcomm: rmnet: Fix some existing functionality
This series fixes some of the broken rmnet functionality from the initial patchset. Bridge mode is re-written and made useable and the muxed_ep is converted to hlist. Patches 1-5 are cleanups in preparation for these changes. Patch 6 does the hlist conversion. Patch 7 has the implementation of the rmnet bridge mode. Note that there will be a compilation error when merging net with net-next due to the addition of the ext ack argument in netdev_master_upper_dev_link / ndo_add_slave. Subash Abhinov Kasiviswanathan (7): net: qualcomm: rmnet: Remove existing logic for bridge mode net: qualcomm: rmnet: Remove some unused defines net: qualcomm: rmnet: Move rmnet_mode to rmnet_port net: qualcomm: rmnet: Remove duplicate setting of rmnet private info net: qualcomm: rmnet: Remove duplicate setting of rmnet_devices net: qualcomm: rmnet: Convert the muxed endpoint to hlist net: qualcomm: rmnet: Implement bridge mode drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 164 - drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 18 +-- .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 131 ++-- .../net/ethernet/qualcomm/rmnet/rmnet_handlers.h | 3 +- .../ethernet/qualcomm/rmnet/rmnet_map_command.c| 4 +- .../net/ethernet/qualcomm/rmnet/rmnet_private.h| 8 - drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c| 36 ++--- drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h| 7 +- 8 files changed, 201 insertions(+), 170 deletions(-) -- 1.9.1
[PATCH net-next v3 5/5] selinux: bpf: Add addtional check for bpf object file receive
From: Chenbo Feng Introduce a bpf object related check when sending and receiving files through unix domain socket as well as binder. It checks if the receiving process have privilege to read/write the bpf map or use the bpf program. This check is necessary because the bpf maps and programs are using a anonymous inode as their shared inode so the normal way of checking the files and sockets when passing between processes cannot work properly on eBPF object. This check only works when the BPF_SYSCALL is configured. The information stored inside the file security struct is the same as the information in bpf object security struct. Signed-off-by: Chenbo Feng --- include/linux/lsm_hooks.h | 17 ++ include/linux/security.h | 9 ++ kernel/bpf/syscall.c | 27 ++-- security/security.c | 8 + security/selinux/hooks.c | 67 +++ security/selinux/include/objsec.h | 9 ++ 6 files changed, 135 insertions(+), 2 deletions(-) diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h index 7161d8e7ee79..517dea60b87b 100644 --- a/include/linux/lsm_hooks.h +++ b/include/linux/lsm_hooks.h @@ -1385,6 +1385,19 @@ * @bpf_prog_free_security: * Clean up the security information stored inside bpf prog. * + * @bpf_map_file: + * When creating a bpf map fd, set up the file security information with + * the bpf security information stored in the map struct. So when the map + * fd is passed between processes, the security module can directly read + * the security information from file security struct rather than the bpf + * security struct. + * + * @bpf_prog_file: + * When creating a bpf prog fd, set up the file security information with + * the bpf security information stored in the prog struct. So when the prog + * fd is passed between processes, the security module can directly read + * the security information from file security struct rather than the bpf + * security struct. */ union security_list_options { int (*binder_set_context_mgr)(struct task_struct *mgr); @@ -1726,6 +1739,8 @@ union security_list_options { void (*bpf_map_free_security)(struct bpf_map *map); int (*bpf_prog_alloc_security)(struct bpf_prog_aux *aux); void (*bpf_prog_free_security)(struct bpf_prog_aux *aux); + void (*bpf_map_file)(struct bpf_map *map, struct file *file); + void (*bpf_prog_file)(struct bpf_prog_aux *aux, struct file *file); #endif /* CONFIG_BPF_SYSCALL */ }; @@ -1954,6 +1969,8 @@ struct security_hook_heads { struct list_head bpf_map_free_security; struct list_head bpf_prog_alloc_security; struct list_head bpf_prog_free_security; + struct list_head bpf_map_file; + struct list_head bpf_prog_file; #endif /* CONFIG_BPF_SYSCALL */ } __randomize_layout; diff --git a/include/linux/security.h b/include/linux/security.h index 18800b0911e5..57573b794e2d 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -1740,6 +1740,8 @@ extern int security_bpf_map_alloc(struct bpf_map *map); extern void security_bpf_map_free(struct bpf_map *map); extern int security_bpf_prog_alloc(struct bpf_prog_aux *aux); extern void security_bpf_prog_free(struct bpf_prog_aux *aux); +extern void security_bpf_map_file(struct bpf_map *map, struct file *file); +extern void security_bpf_prog_file(struct bpf_prog_aux *aux, struct file *file); #else static inline int security_bpf(int cmd, union bpf_attr *attr, unsigned int size) @@ -1772,6 +1774,13 @@ static inline int security_bpf_prog_alloc(struct bpf_prog_aux *aux) static inline void security_bpf_prog_free(struct bpf_prog_aux *aux) { } + +static inline void security_bpf_map_file(struct bpf_map *map, struct file *file) +{ } + +static inline void security_bpf_prog_file(struct bpf_prog_aux *aux, + struct file *file) +{ } #endif /* CONFIG_SECURITY */ #endif /* CONFIG_BPF_SYSCALL */ diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 1cf31ddd7616..aee69e564c50 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -324,11 +324,22 @@ static const struct file_operations bpf_map_fops = { int bpf_map_new_fd(struct bpf_map *map, int flags) { + int fd; + struct fd f; if (security_bpf_map(map, OPEN_FMODE(flags))) return -EPERM; - return anon_inode_getfd("bpf-map", &bpf_map_fops, map, + fd = anon_inode_getfd("bpf-map", &bpf_map_fops, map, flags | O_CLOEXEC); + if (fd < 0) + return fd; + + f = fdget(fd); + if (!f.file) + return -EBADF; + security_bpf_map_file(map, f.file); + fdput(f); + return fd; } int bpf_get_file_flag(int flags) @@ -975,11 +986,23 @@ static const struct file_operations
[PATCH net-next v3 2/5] bpf: Add tests for eBPF file mode
From: Chenbo Feng Two related tests are added into bpf selftest to test read only map and write only map. The tests verified the read only and write only flags are working on hash maps. Signed-off-by: Chenbo Feng --- tools/testing/selftests/bpf/test_maps.c | 48 + 1 file changed, 48 insertions(+) diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c index fe3a443a1102..896f23cfe918 100644 --- a/tools/testing/selftests/bpf/test_maps.c +++ b/tools/testing/selftests/bpf/test_maps.c @@ -1033,6 +1033,51 @@ static void test_map_parallel(void) assert(bpf_map_get_next_key(fd, &key, &key) == -1 && errno == ENOENT); } +static void test_map_rdonly(void) +{ + int i, fd, key = 0, value = 0; + + fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), + MAP_SIZE, map_flags | BPF_F_RDONLY); + if (fd < 0) { + printf("Failed to create map for read only test '%s'!\n", + strerror(errno)); + exit(1); + } + + key = 1; + value = 1234; + /* Insert key=1 element. */ + assert(bpf_map_update_elem(fd, &key, &value, BPF_ANY) == -1 && + errno == EPERM); + + /* Check that key=2 is not found. */ + assert(bpf_map_lookup_elem(fd, &key, &value) == -1 && errno == ENOENT); + assert(bpf_map_get_next_key(fd, &key, &value) == -1 && errno == ENOENT); +} + +static void test_map_wronly(void) +{ + int i, fd, key = 0, value = 0; + + fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), + MAP_SIZE, map_flags | BPF_F_WRONLY); + if (fd < 0) { + printf("Failed to create map for read only test '%s'!\n", + strerror(errno)); + exit(1); + } + + key = 1; + value = 1234; + /* Insert key=1 element. */ + assert(bpf_map_update_elem(fd, &key, &value, BPF_ANY) == 0) + + /* Check that key=2 is not found. */ + assert(bpf_map_lookup_elem(fd, &key, &value) == -1 && errno == EPERM); + assert(bpf_map_get_next_key(fd, &key, &value) == -1 && errno == EPERM); +} + static void run_all_tests(void) { test_hashmap(0, NULL); @@ -1050,6 +1095,9 @@ static void run_all_tests(void) test_map_large(); test_map_parallel(); test_map_stress(); + + test_map_rdonly(); + test_map_wronly(); } int main(void) -- 2.14.2.920.gcf0c67979c-goog
[PATCH net-next v3 1/5] bpf: Add file mode configuration into bpf maps
From: Chenbo Feng Introduce the map read/write flags to the eBPF syscalls that returns the map fd. The flags is used to set up the file mode when construct a new file descriptor for bpf maps. To not break the backward capability, the f_flags is set to O_RDWR if the flag passed by syscall is 0. Otherwise it should be O_RDONLY or O_WRONLY. When the userspace want to modify or read the map content, it will check the file mode to see if it is allowed to make the change. Signed-off-by: Chenbo Feng Acked-by: Alexei Starovoitov --- include/linux/bpf.h | 6 ++-- include/uapi/linux/bpf.h | 6 kernel/bpf/arraymap.c| 6 +++- kernel/bpf/devmap.c | 5 ++- kernel/bpf/hashtab.c | 5 +-- kernel/bpf/inode.c | 15 ++--- kernel/bpf/lpm_trie.c| 3 +- kernel/bpf/sockmap.c | 5 ++- kernel/bpf/stackmap.c| 5 ++- kernel/bpf/syscall.c | 80 +++- 10 files changed, 114 insertions(+), 22 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index bc7da2ddfcaf..0e9ca2555d7f 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -308,11 +308,11 @@ void bpf_map_area_free(void *base); extern int sysctl_unprivileged_bpf_disabled; -int bpf_map_new_fd(struct bpf_map *map); +int bpf_map_new_fd(struct bpf_map *map, int flags); int bpf_prog_new_fd(struct bpf_prog *prog); int bpf_obj_pin_user(u32 ufd, const char __user *pathname); -int bpf_obj_get_user(const char __user *pathname); +int bpf_obj_get_user(const char __user *pathname, int flags); int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value); int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value); @@ -331,6 +331,8 @@ int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file, void *key, void *value, u64 map_flags); int bpf_fd_htab_map_lookup_elem(struct bpf_map *map, void *key, u32 *value); +int bpf_get_file_flag(int flags); + /* memcpy that is used with 8-byte aligned pointers, power-of-8 size and * forced to use 'long' read/writes to try to atomically copy long counters. * Best-effort only. No barriers here, since it _will_ race with concurrent diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 6db9e1d679cd..9cb50a228c39 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -217,6 +217,10 @@ enum bpf_attach_type { #define BPF_OBJ_NAME_LEN 16U +/* Flags for accessing BPF object */ +#define BPF_F_RDONLY (1U << 3) +#define BPF_F_WRONLY (1U << 4) + union bpf_attr { struct { /* anonymous struct used by BPF_MAP_CREATE command */ __u32 map_type; /* one of enum bpf_map_type */ @@ -259,6 +263,7 @@ union bpf_attr { struct { /* anonymous struct used by BPF_OBJ_* commands */ __aligned_u64 pathname; __u32 bpf_fd; + __u32 file_flags; }; struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ @@ -286,6 +291,7 @@ union bpf_attr { __u32 map_id; }; __u32 next_id; + __u32 open_flags; }; struct { /* anonymous struct used by BPF_OBJ_GET_INFO_BY_FD */ diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 68d866628be0..988c04c91e10 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -19,6 +19,9 @@ #include "map_in_map.h" +#define ARRAY_CREATE_FLAG_MASK \ + (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY) + static void bpf_array_free_percpu(struct bpf_array *array) { int i; @@ -56,7 +59,8 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr) /* check sanity of attributes */ if (attr->max_entries == 0 || attr->key_size != 4 || - attr->value_size == 0 || attr->map_flags & ~BPF_F_NUMA_NODE || + attr->value_size == 0 || + attr->map_flags & ~ARRAY_CREATE_FLAG_MASK || (percpu && numa_node != NUMA_NO_NODE)) return ERR_PTR(-EINVAL); diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c index e093d9a2c4dd..e5d3de7cff2e 100644 --- a/kernel/bpf/devmap.c +++ b/kernel/bpf/devmap.c @@ -50,6 +50,9 @@ #include #include +#define DEV_CREATE_FLAG_MASK \ + (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY) + struct bpf_dtab_netdev { struct net_device *dev; struct bpf_dtab *dtab; @@ -80,7 +83,7 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr) /* check sanity of attributes */ if (attr->max_entries == 0 || attr->key_size != 4 || - attr->value_size != 4 || attr->map_flags & ~BPF_F_NUMA_NODE) + attr->value_size != 4 || attr->map_flags & ~DEV_CREATE_FLAG_MASK) return ERR_PTR(-EINVAL); dtab = kzalloc(sizeof(*dtab), GFP_USER); diff --git a/kernel/bpf/hashtab.c
[PATCH net-next v3 3/5] security: bpf: Add LSM hooks for bpf object related syscall
From: Chenbo Feng Introduce several LSM hooks for the syscalls that will allow the userspace to access to eBPF object such as eBPF programs and eBPF maps. The security check is aimed to enforce a per object security protection for eBPF object so only processes with the right priviliges can read/write to a specific map or use a specific eBPF program. Besides that, a general security hook is added before the multiplexer of bpf syscall to check the cmd and the attribute used for the command. The actual security module can decide which command need to be checked and how the cmd should be checked. Signed-off-by: Chenbo Feng --- include/linux/bpf.h | 6 ++ include/linux/lsm_hooks.h | 54 +++ include/linux/security.h | 45 +++ kernel/bpf/syscall.c | 28 ++-- security/security.c | 32 5 files changed, 163 insertions(+), 2 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 0e9ca2555d7f..225740688ab7 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -57,6 +57,9 @@ struct bpf_map { atomic_t usercnt; struct bpf_map *inner_map_meta; char name[BPF_OBJ_NAME_LEN]; +#ifdef CONFIG_SECURITY + void *security; +#endif }; /* function argument constraints */ @@ -190,6 +193,9 @@ struct bpf_prog_aux { struct user_struct *user; u64 load_time; /* ns since boottime */ char name[BPF_OBJ_NAME_LEN]; +#ifdef CONFIG_SECURITY + void *security; +#endif union { struct work_struct work; struct rcu_head rcu; diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h index c9258124e417..7161d8e7ee79 100644 --- a/include/linux/lsm_hooks.h +++ b/include/linux/lsm_hooks.h @@ -1351,6 +1351,40 @@ * @inode we wish to get the security context of. * @ctx is a pointer in which to place the allocated security context. * @ctxlen points to the place to put the length of @ctx. + * + * Security hooks for using the eBPF maps and programs functionalities through + * eBPF syscalls. + * + * @bpf: + * Do a initial check for all bpf syscalls after the attribute is copied + * into the kernel. The actual security module can implement their own + * rules to check the specific cmd they need. + * + * @bpf_map: + * Do a check when the kernel generate and return a file descriptor for + * eBPF maps. + * + * @map: bpf map that we want to access + * @mask: the access flags + * + * @bpf_prog: + * Do a check when the kernel generate and return a file descriptor for + * eBPF programs. + * + * @prog: bpf prog that userspace want to use. + * + * @bpf_map_alloc_security: + * Initialize the security field inside bpf map. + * + * @bpf_map_free_security: + * Clean up the security information stored inside bpf map. + * + * @bpf_prog_alloc_security: + * Initialize the security field inside bpf program. + * + * @bpf_prog_free_security: + * Clean up the security information stored inside bpf prog. + * */ union security_list_options { int (*binder_set_context_mgr)(struct task_struct *mgr); @@ -1682,6 +1716,17 @@ union security_list_options { struct audit_context *actx); void (*audit_rule_free)(void *lsmrule); #endif /* CONFIG_AUDIT */ + +#ifdef CONFIG_BPF_SYSCALL + int (*bpf)(int cmd, union bpf_attr *attr, +unsigned int size); + int (*bpf_map)(struct bpf_map *map, fmode_t fmode); + int (*bpf_prog)(struct bpf_prog *prog); + int (*bpf_map_alloc_security)(struct bpf_map *map); + void (*bpf_map_free_security)(struct bpf_map *map); + int (*bpf_prog_alloc_security)(struct bpf_prog_aux *aux); + void (*bpf_prog_free_security)(struct bpf_prog_aux *aux); +#endif /* CONFIG_BPF_SYSCALL */ }; struct security_hook_heads { @@ -1901,6 +1946,15 @@ struct security_hook_heads { struct list_head audit_rule_match; struct list_head audit_rule_free; #endif /* CONFIG_AUDIT */ +#ifdef CONFIG_BPF_SYSCALL + struct list_head bpf; + struct list_head bpf_map; + struct list_head bpf_prog; + struct list_head bpf_map_alloc_security; + struct list_head bpf_map_free_security; + struct list_head bpf_prog_alloc_security; + struct list_head bpf_prog_free_security; +#endif /* CONFIG_BPF_SYSCALL */ } __randomize_layout; /* diff --git a/include/linux/security.h b/include/linux/security.h index ce6265960d6c..18800b0911e5 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -31,6 +31,7 @@ #include #include #include +#include struct linux_binprm; struct cred; @@ -1730,6 +1731,50 @@ static inline void securityfs_remove(struct dentry *dentry) #endif +#ifdef CONFIG_BPF_SYSCALL +#ifdef CONFIG_SECURITY +extern int security_bpf(int cmd, uni
[PATCH net-next v3 4/5] selinux: bpf: Add selinux check for eBPF syscall operations
From: Chenbo Feng Implement the actual checks introduced to eBPF related syscalls. This implementation use the security field inside bpf object to store a sid that identify the bpf object. And when processes try to access the object, selinux will check if processes have the right privileges. The creation of eBPF object are also checked at the general bpf check hook and new cmd introduced to eBPF domain can also be checked there. Signed-off-by: Chenbo Feng Acked-by: Alexei Starovoitov --- security/selinux/hooks.c| 111 security/selinux/include/classmap.h | 2 + security/selinux/include/objsec.h | 4 ++ 3 files changed, 117 insertions(+) diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index f5d304736852..94e473b9c884 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -85,6 +85,7 @@ #include #include #include +#include #include "avc.h" #include "objsec.h" @@ -6252,6 +6253,106 @@ static void selinux_ib_free_security(void *ib_sec) } #endif +#ifdef CONFIG_BPF_SYSCALL +static int selinux_bpf(int cmd, union bpf_attr *attr, +unsigned int size) +{ + u32 sid = current_sid(); + int ret; + + switch (cmd) { + case BPF_MAP_CREATE: + ret = avc_has_perm(sid, sid, SECCLASS_BPF, BPF__MAP_CREATE, + NULL); + break; + case BPF_PROG_LOAD: + ret = avc_has_perm(sid, sid, SECCLASS_BPF, BPF__PROG_LOAD, + NULL); + break; + default: + ret = 0; + break; + } + + return ret; +} + +static u32 bpf_map_fmode_to_av(fmode_t fmode) +{ + u32 av = 0; + + if (fmode & FMODE_READ) + av |= BPF__MAP_READ; + if (fmode & FMODE_WRITE) + av |= BPF__MAP_WRITE; + return av; +} + +static int selinux_bpf_map(struct bpf_map *map, fmode_t fmode) +{ + u32 sid = current_sid(); + struct bpf_security_struct *bpfsec; + + bpfsec = map->security; + return avc_has_perm(sid, bpfsec->sid, SECCLASS_BPF, + bpf_map_fmode_to_av(fmode), NULL); +} + +static int selinux_bpf_prog(struct bpf_prog *prog) +{ + u32 sid = current_sid(); + struct bpf_security_struct *bpfsec; + + bpfsec = prog->aux->security; + return avc_has_perm(sid, bpfsec->sid, SECCLASS_BPF, + BPF__PROG_USE, NULL); +} + +static int selinux_bpf_map_alloc(struct bpf_map *map) +{ + struct bpf_security_struct *bpfsec; + + bpfsec = kzalloc(sizeof(*bpfsec), GFP_KERNEL); + if (!bpfsec) + return -ENOMEM; + + bpfsec->sid = current_sid(); + map->security = bpfsec; + + return 0; +} + +static void selinux_bpf_map_free(struct bpf_map *map) +{ + struct bpf_security_struct *bpfsec = map->security; + + map->security = NULL; + kfree(bpfsec); +} + +static int selinux_bpf_prog_alloc(struct bpf_prog_aux *aux) +{ + struct bpf_security_struct *bpfsec; + + bpfsec = kzalloc(sizeof(*bpfsec), GFP_KERNEL); + if (!bpfsec) + return -ENOMEM; + + bpfsec->sid = current_sid(); + aux->security = bpfsec; + + return 0; +} + +static void selinux_bpf_prog_free(struct bpf_prog_aux *aux) +{ + struct bpf_security_struct *bpfsec = aux->security; + + aux->security = NULL; + kfree(bpfsec); +} +#endif + static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = { LSM_HOOK_INIT(binder_set_context_mgr, selinux_binder_set_context_mgr), LSM_HOOK_INIT(binder_transaction, selinux_binder_transaction), @@ -6471,6 +6572,16 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = { LSM_HOOK_INIT(audit_rule_match, selinux_audit_rule_match), LSM_HOOK_INIT(audit_rule_free, selinux_audit_rule_free), #endif + +#ifdef CONFIG_BPF_SYSCALL + LSM_HOOK_INIT(bpf, selinux_bpf), + LSM_HOOK_INIT(bpf_map, selinux_bpf_map), + LSM_HOOK_INIT(bpf_prog, selinux_bpf_prog), + LSM_HOOK_INIT(bpf_map_alloc_security, selinux_bpf_map_alloc), + LSM_HOOK_INIT(bpf_prog_alloc_security, selinux_bpf_prog_alloc), + LSM_HOOK_INIT(bpf_map_free_security, selinux_bpf_map_free), + LSM_HOOK_INIT(bpf_prog_free_security, selinux_bpf_prog_free), +#endif }; static __init int selinux_init(void) diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h index 35ffb29a69cb..a91fa46a789f 100644 --- a/security/selinux/include/classmap.h +++ b/security/selinux/include/classmap.h @@ -237,6 +237,8 @@ struct security_class_mapping secclass_map[] = { { "access", NULL } }, { "infiniband_endport", { "manage_subnet", NULL } }, + { "bpf", + {"map_create", "map_read", "map_write", "prog_load", "prog_use"} }, { NULL }
[PATCH net-next v3 0/5] bpf: security: New file mode and LSM hooks for eBPF object permission control
From: Chenbo Feng Much like files and sockets, eBPF objects are accessed, controlled, and shared via a file descriptor (FD). Unlike files and sockets, the existing mechanism for eBPF object access control is very limited. Currently there are two options for granting accessing to eBPF operations: grant access to all processes, or only CAP_SYS_ADMIN processes. The CAP_SYS_ADMIN-only mode is not ideal because most users do not have this capability and granting a user CAP_SYS_ADMIN grants too many other security-sensitive permissions. It also unnecessarily allows all CAP_SYS_ADMIN processes access to eBPF functionality. Allowing all processes to access to eBPF objects is also undesirable since it has potential to allow unprivileged processes to consume kernel memory, and opens up attack surface to the kernel. Adding LSM hooks maintains the status quo for systems which do not use an LSM, preserving compatibility with userspace, while allowing security modules to choose how best to handle permissions on eBPF objects. Here is a possible use case for the lsm hooks with selinux module: The network-control daemon (netd) creates and loads an eBPF object for network packet filtering and analysis. It passes the object FD to an unprivileged network monitor app (netmonitor), which is not allowed to create, modify or load eBPF objects, but is allowed to read the traffic stats from the map. Selinux could use these hooks to grant the following permissions: allow netd self:bpf_map { create read write}; allow netmonitor netd:fd use; allow netmonitor netd:bpf_map read; In this patch series, A file mode is added to bpf map to store the accessing mode. With this file mode flags, the map can be obtained read only, write only or read and write. With the help of this file mode, several security hooks can be added to the eBPF syscall implementations to do permissions checks. These LSM hooks are mainly focused on checking the process privileges before it obtains the fd for a specific bpf object. No matter from a file location or from a eBPF id. Besides that, a general check hook is also implemented at the start of bpf syscalls so that each security module can have their own implementation on the reset of bpf object related functionalities. In order to store the ownership and security information about eBPF maps, a security field pointer is added to the struct bpf_map. And the last two patch set are implementation of selinux check on these hooks introduced, plus an additional check when eBPF object is passed between processes using unix socket as well as binder IPC. Change since V1: - Whitelist the new bpf flags in the map allocate check. - Added bpf selftest for the new flags. - Added two new security hooks for copying the security information from the bpf object security struct to file security struct - Simplified the checking action when bpf fd is passed between processes. Change since V2: - Fixed the line break problem for map flags check - Fixed the typo in selinux check of file mode. - Merge bpf_map and bpf_prog into one selinux class - Added bpf_type and bpf_sid into file security struct to store the security information when generate fd. - Add the hook to bpf_map_new_fd and bpf_prog_new_fd. Chenbo Feng (5): bpf: Add file mode configuration into bpf maps bpf: Add tests for eBPF file mode security: bpf: Add LSM hooks for bpf object related syscall selinux: bpf: Add selinux check for eBPF syscall operations selinux: bpf: Add addtional check for bpf object file receive include/linux/bpf.h | 12 ++- include/linux/lsm_hooks.h | 71 + include/linux/security.h| 54 ++ include/uapi/linux/bpf.h| 6 ++ kernel/bpf/arraymap.c | 6 +- kernel/bpf/devmap.c | 5 +- kernel/bpf/hashtab.c| 5 +- kernel/bpf/inode.c | 15 ++- kernel/bpf/lpm_trie.c | 3 +- kernel/bpf/sockmap.c| 5 +- kernel/bpf/stackmap.c | 5 +- kernel/bpf/syscall.c| 108 ++-- security/security.c | 40 security/selinux/hooks.c| 174 security/selinux/include/classmap.h | 2 + security/selinux/include/objsec.h | 13 +++ tools/testing/selftests/bpf/test_maps.c | 48 + 17 files changed, 548 insertions(+), 24 deletions(-) -- 2.14.2.920.gcf0c67979c-goog
Re: [PATCH net-next v2] openvswitch: add ct_clear action
From: Pravin Shelar Date: Tue, 10 Oct 2017 16:34:29 -0700 > On Tue, Oct 10, 2017 at 1:54 PM, Eric Garver wrote: >> This adds a ct_clear action for clearing conntrack state. ct_clear is >> currently implemented in OVS userspace, but is not backed by an action >> in the kernel datapath. This is useful for flows that may modify a >> packet tuple after a ct lookup has already occurred. >> >> Signed-off-by: Eric Garver >> --- >> v2: >> - Use IP_CT_UNTRACKED for nf_ct_set() >> - Only fill key if previously conntracked >> > Looks good. > Acked-by: Pravin B Shelar Applied.
Re: [PATCH net-next v2] openvswitch: add ct_clear action
On Tue, Oct 10, 2017 at 1:54 PM, Eric Garver wrote: > This adds a ct_clear action for clearing conntrack state. ct_clear is > currently implemented in OVS userspace, but is not backed by an action > in the kernel datapath. This is useful for flows that may modify a > packet tuple after a ct lookup has already occurred. > > Signed-off-by: Eric Garver > --- > v2: > - Use IP_CT_UNTRACKED for nf_ct_set() > - Only fill key if previously conntracked > Looks good. Acked-by: Pravin B Shelar
Re: [PATCH net-next] net: dst: move cpu inside ifdef to avoid compilation warning
From: Jakub Kicinski Date: Tue, 10 Oct 2017 15:05:39 -0700 > If CONFIG_DST_CACHE is not selected cpu variable > will be unused and we will see a compilation warning. > Move it under the ifdef. > > Reported-by: kbuild test robot > Fixes: d66f2b91f95b ("bpf: don't rely on the verifier lock for metadata_dst > allocation") > Signed-off-by: Jakub Kicinski Applied.
Re: [net-next V6 PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
On 10/10/2017 02:47 PM, Jesper Dangaard Brouer wrote: [...] +static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) +{ + struct bpf_cpu_map *cmap; + int err = -ENOMEM; + u64 cost; + int ret; + + if (!capable(CAP_SYS_ADMIN)) + return ERR_PTR(-EPERM); + + /* check sanity of attributes */ + if (attr->max_entries == 0 || attr->key_size != 4 || + attr->value_size != 4 || attr->map_flags & ~BPF_F_NUMA_NODE) + return ERR_PTR(-EINVAL); + + cmap = kzalloc(sizeof(*cmap), GFP_USER); + if (!cmap) + return ERR_PTR(-ENOMEM); + + /* mandatory map attributes */ + cmap->map.map_type = attr->map_type; + cmap->map.key_size = attr->key_size; + cmap->map.value_size = attr->value_size; + cmap->map.max_entries = attr->max_entries; + cmap->map.map_flags = attr->map_flags; + cmap->map.numa_node = bpf_map_attr_numa_node(attr); + + /* Pre-limit array size based on NR_CPUS, not final CPU check */ + if (cmap->map.max_entries > NR_CPUS) + return ERR_PTR(-E2BIG); We still have a leak here, meaning kfree(cmap) is missing on above error. + + /* make sure page count doesn't overflow */ + cost = (u64) cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *); + cost += cpu_map_bitmap_size(attr) * num_possible_cpus(); + if (cost >= U32_MAX - PAGE_SIZE) + goto free_cmap; + cmap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT; + + /* Notice returns -EPERM on if map size is larger than memlock limit */ + ret = bpf_map_precharge_memlock(cmap->map.pages); + if (ret) { + err = ret; + goto free_cmap; + } + + /* A per cpu bitfield with a bit per possible CPU in map */ + cmap->flush_needed = __alloc_percpu(cpu_map_bitmap_size(attr), + __alignof__(unsigned long)); + if (!cmap->flush_needed) + goto free_cmap; + + /* Alloc array for possible remote "destination" CPUs */ + cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries * + sizeof(struct bpf_cpu_map_entry *), + cmap->map.numa_node); + if (!cmap->cpu_map) + goto free_cmap; + + return &cmap->map; +free_cmap: + free_percpu(cmap->flush_needed); + kfree(cmap); + return ERR_PTR(err); +}
[PATCH] hdlc: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to all timer callbacks, switch to using the new timer_setup() and from_timer() to pass the timer pointer explicitly. This adds a pointer back to the net_device, and drops needless open-coded resetting of the .function and .data fields. Cc: David S. Miller Cc: Krzysztof Halasa Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook --- This requires commit 686fef928bba ("timer: Prepare to change timer callback argument type") in v4.14-rc3, but should be otherwise stand-alone. --- drivers/net/wan/hdlc_cisco.c | 15 ++- drivers/net/wan/hdlc_fr.c| 13 ++--- 2 files changed, 12 insertions(+), 16 deletions(-) diff --git a/drivers/net/wan/hdlc_cisco.c b/drivers/net/wan/hdlc_cisco.c index a408abc25512..320039d329c7 100644 --- a/drivers/net/wan/hdlc_cisco.c +++ b/drivers/net/wan/hdlc_cisco.c @@ -54,6 +54,7 @@ struct cisco_state { cisco_proto settings; struct timer_list timer; + struct net_device *dev; spinlock_t lock; unsigned long last_poll; int up; @@ -257,11 +258,10 @@ static int cisco_rx(struct sk_buff *skb) -static void cisco_timer(unsigned long arg) +static void cisco_timer(struct timer_list *t) { - struct net_device *dev = (struct net_device *)arg; - hdlc_device *hdlc = dev_to_hdlc(dev); - struct cisco_state *st = state(hdlc); + struct cisco_state *st = from_timer(st, t, timer); + struct net_device *dev = st->dev; spin_lock(&st->lock); if (st->up && @@ -276,8 +276,6 @@ static void cisco_timer(unsigned long arg) spin_unlock(&st->lock); st->timer.expires = jiffies + st->settings.interval * HZ; - st->timer.function = cisco_timer; - st->timer.data = arg; add_timer(&st->timer); } @@ -293,10 +291,9 @@ static void cisco_start(struct net_device *dev) st->up = st->txseq = st->rxseq = 0; spin_unlock_irqrestore(&st->lock, flags); - init_timer(&st->timer); + st->dev = dev; + timer_setup(&st->timer, cisco_timer, 0); st->timer.expires = jiffies + HZ; /* First poll after 1 s */ - st->timer.function = cisco_timer; - st->timer.data = (unsigned long)dev; add_timer(&st->timer); } diff --git a/drivers/net/wan/hdlc_fr.c b/drivers/net/wan/hdlc_fr.c index 78596e42a3f3..038236a9c60e 100644 --- a/drivers/net/wan/hdlc_fr.c +++ b/drivers/net/wan/hdlc_fr.c @@ -140,6 +140,7 @@ struct frad_state { int dce_pvc_count; struct timer_list timer; + struct net_device *dev; unsigned long last_poll; int reliable; int dce_changed; @@ -597,9 +598,10 @@ static void fr_set_link_state(int reliable, struct net_device *dev) } -static void fr_timer(unsigned long arg) +static void fr_timer(struct timer_list *t) { - struct net_device *dev = (struct net_device *)arg; + struct frad_state *st = from_timer(st, t, timer); + struct net_device *dev = st->dev; hdlc_device *hdlc = dev_to_hdlc(dev); int i, cnt = 0, reliable; u32 list; @@ -644,8 +646,6 @@ static void fr_timer(unsigned long arg) state(hdlc)->settings.t391 * HZ; } - state(hdlc)->timer.function = fr_timer; - state(hdlc)->timer.data = arg; add_timer(&state(hdlc)->timer); } @@ -1003,11 +1003,10 @@ static void fr_start(struct net_device *dev) state(hdlc)->n391cnt = 0; state(hdlc)->txseq = state(hdlc)->rxseq = 0; - init_timer(&state(hdlc)->timer); + state(hdlc)->dev = dev; + timer_setup(&state(hdlc)->timer, fr_timer, 0); /* First poll after 1 s */ state(hdlc)->timer.expires = jiffies + HZ; - state(hdlc)->timer.function = fr_timer; - state(hdlc)->timer.data = (unsigned long)dev; add_timer(&state(hdlc)->timer); } else fr_set_link_state(1, dev); -- 2.7.4 -- Kees Cook Pixel Security
[PATCH net-next] net: dst: move cpu inside ifdef to avoid compilation warning
If CONFIG_DST_CACHE is not selected cpu variable will be unused and we will see a compilation warning. Move it under the ifdef. Reported-by: kbuild test robot Fixes: d66f2b91f95b ("bpf: don't rely on the verifier lock for metadata_dst allocation") Signed-off-by: Jakub Kicinski --- net/core/dst.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/dst.c b/net/core/dst.c index 8b2eafac984d..662a2d4a3d19 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -325,9 +325,9 @@ EXPORT_SYMBOL_GPL(metadata_dst_alloc_percpu); void metadata_dst_free_percpu(struct metadata_dst __percpu *md_dst) { +#ifdef CONFIG_DST_CACHE int cpu; -#ifdef CONFIG_DST_CACHE for_each_possible_cpu(cpu) { struct metadata_dst *one_md_dst = per_cpu_ptr(md_dst, cpu); -- 2.14.1
[PATCH] i40e: only redistribute MSI-X vectors when needed
Whether or not there are vectors_left, we only need to redistribute our vectors if we didn't get as many as we requested. With the current check, the code will try to redistribute even if we did in fact get all the vectors we requested - this can happen when we have more CPUs than we do vectors. This restores an earlier check to be sure we only redistribute if we didn't get the full count we requested. Fixes: 4ce20abc645f (i40e: fix MSI-X vector redistribution if hw limit is reached) Signed-off-by: Shannon Nelson --- drivers/net/ethernet/intel/i40e/i40e_main.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index bf91958..535e6e7 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -8176,7 +8176,7 @@ static int i40e_init_msix(struct i40e_pf *pf) pf->num_lan_qps = 1; pf->num_lan_msix = 1; - } else if (!vectors_left) { + } else if (v_actual != v_budget) { /* If we have limited resources, we will start with no vectors * for the special features and then allocate vectors to some * of these features based on the policy and at the end disable @@ -8185,7 +8185,8 @@ static int i40e_init_msix(struct i40e_pf *pf) int vec; dev_info(&pf->pdev->dev, -"MSI-X vector limit reached, attempting to redistribute vectors\n"); +"MSI-X vector limit reached with %d, wanted %d, attempting to redistribute vectors\n", +v_actual, v_budget); /* reserve the misc vector */ vec = v_actual - 1; -- 1.7.1
Re: [Patch net-next] tcp: add a tracepoint for tcp_retransmit_skb()
Alexei Starovoitov writes: > On Mon, Oct 09, 2017 at 10:35:47PM -0700, Cong Wang wrote: [...] >> +trace_tcp_retransmit_skb(sk, skb, segs); > > I'm happy to see new tracepoints being added to tcp stack, but I'm concerned > with practical usability of them. > Like the above tracepoint definition makes it not very useful from bpf point > of view, > since 'sk' pointer is not recored by as part of the tracepoint. > In bpf/tracing world we prefer tracepoints to have raw pointers recorded > in TP_STRUCT__entry() and _not_ printed in TP_printk() > (since pointers are useless for userspace). Ack. Also could the TP_printk also use the socket cookies so they can get associated with netlink dumps and as such also be associated to user space processes? It could help against races while trying to associate the socket with a process. ss already supports dumping those cookies with -e. The corresponding commit would be: commit 33cf7c90fe2f97afb1cadaa0cfb782cb9d1b9ee2 Author: Eric Dumazet Date: Wed Mar 11 18:53:14 2015 -0700 net: add real socket cookies Right now they only get set when needed but as Eric already mentioned in his commit log, this could be refined. [...]
Re: [patch net-next 3/4] net: sched: convert cls_flower->egress_dev users to tc_setup_cb_egdev infra
On Wed, Oct 11, 2017 at 12:16 AM, Jiri Pirko wrote: > Tue, Oct 10, 2017 at 10:08:23PM CEST, gerlitz...@gmail.com wrote: >>On Tue, Oct 10, 2017 at 10:30 AM, Jiri Pirko wrote: >>> The only user of cls_flower->egress_dev is mlx5. >> >>but nfp supports decap action offload too and from the flower code >>stand point, I guess they are both the same, right? how does it work >>there? > > Apparently they don't use cls_flower->egress_dev. John, can you elaborate on that, how do you manage to get away from that practice?
Re: [patch net-next 0/4] net: sched: get rid of cls_flower->egress_dev
On Wed, Oct 11, 2017 at 12:13 AM, Jiri Pirko wrote: > Tue, Oct 10, 2017 at 07:24:21PM CEST, gerlitz...@gmail.com wrote: > Or, as I replied to you earlier, the issue you describe is totally > unrelated to this patchset as you see the issue with the current net-next. Jiri, the point I wanted to make that if indeed there's a bug in mlx5 or flower, we will have to fix it for 4.14 and then these bits would have to be rebased when net-next is re-planted over net, I put "FWIW" before that, so maybe it doesn't W so much, we'll see. Or.
[PATCH v2 nf-next 1/2] netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore
xt_replace_table relies on table replacement counter retrieval (which uses xt_recseq to synchronize pcpu counters). This is fine, however with large rule set get_counters() can take a very long time -- it needs to synchronize all counters because it has to assume concurrent modifications can occur. Make xt_replace_table synchronize by itself by waiting until all cpus had an even seqcount. This allows a followup patch to copy the counters of the old ruleset without any synchonization after xt_replace_table has completed. Cc: Dan Williams Cc: Eric Dumazet Signed-off-by: Florian Westphal --- v2: fix Erics email address net/netfilter/x_tables.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index c83a3b5e1c6c..f2d4a365768f 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -1153,6 +1153,7 @@ xt_replace_table(struct xt_table *table, int *error) { struct xt_table_info *private; + unsigned int cpu; int ret; ret = xt_jumpstack_alloc(newinfo); @@ -1184,12 +1185,20 @@ xt_replace_table(struct xt_table *table, /* * Even though table entries have now been swapped, other CPU's -* may still be using the old entries. This is okay, because -* resynchronization happens because of the locking done -* during the get_counters() routine. +* may still be using the old entries... */ local_bh_enable(); + /* ... so wait for even xt_recseq on all cpus */ + for_each_possible_cpu(cpu) { + seqcount_t *s = &per_cpu(xt_recseq, cpu); + + while (raw_read_seqcount(s) & 1) + cpu_relax(); + + cond_resched(); + } + #ifdef CONFIG_AUDIT if (audit_enabled) { audit_log(current->audit_context, GFP_KERNEL, -- 2.13.6
[PATCH v2 nf-next 2/2] netfilter: x_tables: don't use seqlock when fetching old counters
after previous commit xt_replace_table will wait until all cpus had even seqcount (i.e., no cpu is accessing old ruleset). Add a 'old' counter retrival version that doesn't synchronize counters. Its not needed, the old counters are not in use anymore at this point. This speeds up table replacement on busy systems with large tables (and many cores). Cc: Dan Williams Cc: Eric Dumazet Signed-off-by: Florian Westphal --- v2: fix Erics email address net/ipv4/netfilter/arp_tables.c | 22 -- net/ipv4/netfilter/ip_tables.c | 23 +-- net/ipv6/netfilter/ip6_tables.c | 22 -- 3 files changed, 61 insertions(+), 6 deletions(-) diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c index 9e2770fd00be..f88221aebc9d 100644 --- a/net/ipv4/netfilter/arp_tables.c +++ b/net/ipv4/netfilter/arp_tables.c @@ -634,6 +634,25 @@ static void get_counters(const struct xt_table_info *t, } } +static void get_old_counters(const struct xt_table_info *t, +struct xt_counters counters[]) +{ + struct arpt_entry *iter; + unsigned int cpu, i; + + for_each_possible_cpu(cpu) { + i = 0; + xt_entry_foreach(iter, t->entries, t->size) { + struct xt_counters *tmp; + + tmp = xt_get_per_cpu_counter(&iter->counters, cpu); + ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt); + ++i; + } + cond_resched(); + } +} + static struct xt_counters *alloc_counters(const struct xt_table *table) { unsigned int countersize; @@ -910,8 +929,7 @@ static int __do_replace(struct net *net, const char *name, (newinfo->number <= oldinfo->initial_entries)) module_put(t->me); - /* Get the old counters, and synchronize with replace */ - get_counters(oldinfo, counters); + get_old_counters(oldinfo, counters); /* Decrease module usage counts and free resource */ loc_cpu_old_entry = oldinfo->entries; diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index 39286e543ee6..4cbe5e80f3bf 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -781,6 +781,26 @@ get_counters(const struct xt_table_info *t, } } +static void get_old_counters(const struct xt_table_info *t, +struct xt_counters counters[]) +{ + struct ipt_entry *iter; + unsigned int cpu, i; + + for_each_possible_cpu(cpu) { + i = 0; + xt_entry_foreach(iter, t->entries, t->size) { + const struct xt_counters *tmp; + + tmp = xt_get_per_cpu_counter(&iter->counters, cpu); + ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt); + ++i; /* macro does multi eval of i */ + } + + cond_resched(); + } +} + static struct xt_counters *alloc_counters(const struct xt_table *table) { unsigned int countersize; @@ -1070,8 +1090,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks, (newinfo->number <= oldinfo->initial_entries)) module_put(t->me); - /* Get the old counters, and synchronize with replace */ - get_counters(oldinfo, counters); + get_old_counters(oldinfo, counters); /* Decrease module usage counts and free resource */ xt_entry_foreach(iter, oldinfo->entries, oldinfo->size) diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c index 01bd3ee5ebc6..f06e25065a34 100644 --- a/net/ipv6/netfilter/ip6_tables.c +++ b/net/ipv6/netfilter/ip6_tables.c @@ -800,6 +800,25 @@ get_counters(const struct xt_table_info *t, } } +static void get_old_counters(const struct xt_table_info *t, +struct xt_counters counters[]) +{ + struct ip6t_entry *iter; + unsigned int cpu, i; + + for_each_possible_cpu(cpu) { + i = 0; + xt_entry_foreach(iter, t->entries, t->size) { + const struct xt_counters *tmp; + + tmp = xt_get_per_cpu_counter(&iter->counters, cpu); + ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt); + ++i; + } + cond_resched(); + } +} + static struct xt_counters *alloc_counters(const struct xt_table *table) { unsigned int countersize; @@ -1090,8 +1109,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks, (newinfo->number <= oldinfo->initial_entries)) module_put(t->me); - /* Get the old counters, and synchronize with replace */ - get_counters(oldinfo, counters); + get_old_counters(oldinfo, counters); /* Decrease module usage cou
[PATCH v2 nf-next] netfilter: x_tables: speed up iptables-restore
iptables-restore can take quite a long time when sytem is busy, in order of half a minute or more. The main reason for this is the way ip(6)tables performs table swap, or, more precisely, expensive sequence lock synchronizations when reading counters. When xt_replace_table assigns the new ruleset pointer, it does not wait for other processors to finish with old ruleset. Instead it relies on the counter sequence lock in get_counters() to do this. This works but this is very costly if system is busy as each counter read operation can possibly be restarted indefinitely. Instead, make xt_replace_table wait until all processors are known to not use the old ruleset anymore. This allows to read the old counters without any locking, no cpu is using the ruleset anymore so counters can't change either. ipv4/netfilter/arp_tables.c | 22 -- ipv4/netfilter/ip_tables.c | 23 +-- ipv6/netfilter/ip6_tables.c | 22 -- netfilter/x_tables.c| 15 --- 4 files changed, 73 insertions(+), 9 deletions(-)
Re: [Patch net-next] tcp: add a tracepoint for tcp_retransmit_skb()
On Tue, Oct 10, 2017 at 10:38 AM, Alexei Starovoitov wrote: > > I'm happy to see new tracepoints being added to tcp stack, but I'm concerned > with practical usability of them. > Like the above tracepoint definition makes it not very useful from bpf point > of view, > since 'sk' pointer is not recored by as part of the tracepoint. > In bpf/tracing world we prefer tracepoints to have raw pointers recorded > in TP_STRUCT__entry() and _not_ printed in TP_printk() > (since pointers are useless for userspace). > Like trace_kfree_skb() tracepoint records raw 'skb' pointer and we can > walk whatever sk_buff fields we need inside the program. > Such approach allows tracepoint to be usable in many more scenarios, since > bpf program can examine kernel datastructures. Sure, I am happy to add them for BPF. The current version is merely for our own use case, other use cases like this are always welcome! > Over the last few years we've been running tcp statistics framework (similar > to web10g) > using 8 kprobes in tcp stack with bpf programs extracting the data and now > we're > ready to share this experience with the community. Right now we're working on > a set > of tracepoints for tcp stack to make the interface more accurate, faster and > more stable. > We're planning to send an RFC patch with these new tracepoints in the comming > weeks. Great! Looking forward to it! > > More concrete, if you can make this trace_tcp_retransmit_skb() to record > sk, skb pointers and err code at the end of __tcp_retransmit_skb() it will > solve > our need as well. Note, currently I only call trace_tcp_retransmit_skb() for successful retransmissions, since you mentioned err code, I guess you want it for failures too? I am not sure if tracing unsuccessful TCP retransmissions is meaningful here, I guess it's needed for BPF to track TCP states? It doesn't harm to add it, at least we can filter out err!=0 since we only care about successful ones. > > So far our list of kprobes is: > int kprobe__tcp_validate_incoming > int kprobe__tcp_send_active_reset > int kprobe__tcp_v4_send_reset > int kprobe__tcp_v6_send_reset > int kprobe__tcp_v4_destroy_sock > int kprobe__tcp_set_state > int kprobe__tcp_retransmit_skb > int kprobe__tcp_rtx_synack > > with tracepoints we can consolidate two of them into one and drop > another one for sure. Notice that tcp_retransmit_skb is on our list too > and currently we're doing extra work inside the program to make it more > accurate which will be unnecessary if this tracepoint is at the end > of __tcp_retransmit_skb(). Yeah, with these tracepoints we would be able to trace more TCP state changes. Thanks!
Re: [PATCH v2] xdp: Sample xdp program implementing ip forward
On 10/10/2017 10:19 AM, Stephen Hemminger wrote: On Tue, 10 Oct 2017 12:58:52 +0530 Christina Jacob wrote: +/* Get the mac address of the interface given interface name */ +static long *getmac(char *iface) +{ + int fd; + struct ifreq ifr; + long *mac = NULL; + + fd = socket(AF_INET, SOCK_DGRAM, 0); + ifr.ifr_addr.sa_family = AF_INET; + strncpy(ifr.ifr_name, iface, IFNAMSIZ - 1); + ioctl(fd, SIOCGIFHWADDR, &ifr); + mac = (long *)ifr.ifr_hwaddr.sa_data; + close(fd); + return mac; Always check return value of ioctl. You are assuming sizeof(long) > 6 bytes. Also the byte order. Also: Returning the address of a local variable (ifr.ifr_hwaddr.sa_data), and then dereferencing it outside of the function is not correct. The casting of the char sa_data[] to a long * may cause alignment faults on some architectures. The may also be endinaness issues depending on how the data are manipulated if you pack all those chars into a long. If we think that a MAC address is char[6], then it may be best to define the data structures as such and manipulate it as an array instead of trying to pack it into a long. Keep working on this though, this program will surely be useful. David Daney
Re: [PATCH net-next v2 6/7] bpf: don't rely on the verifier lock for metadata_dst allocation
Hi Jakub, [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Jakub-Kicinski/bpf-get-rid-of-global-verifier-state-and-reuse-instruction-printer/20171011-021905 config: x86_64-randconfig-a0-10110234 (attached as .config) compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All warnings (new ones prefixed by >>): net//core/dst.c: In function 'metadata_dst_free_percpu': >> net//core/dst.c:328: warning: unused variable 'cpu' vim +/cpu +328 net//core/dst.c 325 326 void metadata_dst_free_percpu(struct metadata_dst __percpu *md_dst) 327 { > 328 int cpu; 329 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH net-next v2 4/5] selinux: bpf: Add selinux check for eBPF syscall operations
Hi Chenbo, [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Chenbo-Feng/bpf-security-New-file-mode-and-LSM-hooks-for-eBPF-object-permission-control/20171011-010349 config: x86_64-randconfig-u0-10110310 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All warnings (new ones prefixed by >>): In file included from include/linux/init.h:4:0, from security/selinux/hooks.c:27: security/selinux/hooks.c: In function 'bpf_map_fmode_to_av': security/selinux/hooks.c:6284:6: error: 'f_mode' undeclared (first use in this function) if (f_mode & FMODE_READ) ^ include/linux/compiler.h:156:30: note: in definition of macro '__trace_if' if (__builtin_constant_p(!!(cond)) ? !!(cond) : \ ^~~~ >> security/selinux/hooks.c:6284:2: note: in expansion of macro 'if' if (f_mode & FMODE_READ) ^~ security/selinux/hooks.c:6284:6: note: each undeclared identifier is reported only once for each function it appears in if (f_mode & FMODE_READ) ^ include/linux/compiler.h:156:30: note: in definition of macro '__trace_if' if (__builtin_constant_p(!!(cond)) ? !!(cond) : \ ^~~~ >> security/selinux/hooks.c:6284:2: note: in expansion of macro 'if' if (f_mode & FMODE_READ) ^~ vim +/if +6284 security/selinux/hooks.c 6279 6280 static u32 bpf_map_fmode_to_av(fmode_t fmode) 6281 { 6282 u32 av = 0; 6283 > 6284 if (f_mode & FMODE_READ) 6285 av |= BPF_MAP__READ; 6286 if (f_mode & FMODE_WRITE) 6287 av |= BPF_MAP__WRITE; 6288 return av; 6289 } 6290 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [patch net-next 1/4] net: sched: make tc_action_ops->get_dev return dev and avoid passing net
Tue, Oct 10, 2017 at 07:44:53PM CEST, xiyou.wangc...@gmail.com wrote: >On Tue, Oct 10, 2017 at 12:30 AM, Jiri Pirko wrote: >> -static int tcf_mirred_device(const struct tc_action *a, struct net *net, >> -struct net_device **mirred_dev) >> +static struct net_device *tcf_mirred_get_dev(const struct tc_action *a) >> { >> - int ifindex = tcf_mirred_ifindex(a); >> + struct tcf_mirred *m = to_mirred(a); >> >> - *mirred_dev = __dev_get_by_index(net, ifindex); >> - if (!*mirred_dev) >> - return -EINVAL; >> - return 0; >> + return __dev_get_by_index(m->net, m->tcfm_ifindex); > >Hmm, why not just return m->tcfm_dev? I just follow the existing code. The change you suggest should be a separate follow-up patch.
Re: [PATCH net-next] cxgb4: Add support for new flash parts
From: Ganesh Goudar Date: Tue, 10 Oct 2017 12:44:13 +0530 > Add support for new flash parts identification, and > also cleanup the flash Part identifying and decoding > code. > > Based on the original work of Casey Leedom > > Signed-off-by: Ganesh Goudar Applied.
Re: [PATCH net-next] cxgb4: add new T5 pci device id's
From: Ganesh Goudar Date: Tue, 10 Oct 2017 12:45:02 +0530 > Add 0x50aa and 0x50ab T5 device id's. > > Signed-off-by: Ganesh Goudar Applied.
Re: [patch net-next 3/4] net: sched: convert cls_flower->egress_dev users to tc_setup_cb_egdev infra
Tue, Oct 10, 2017 at 10:08:23PM CEST, gerlitz...@gmail.com wrote: >On Tue, Oct 10, 2017 at 10:30 AM, Jiri Pirko wrote: >> The only user of cls_flower->egress_dev is mlx5. > >but nfp supports decap action offload too and from the flower code >stand point, I guess they are both the same, right? how does it work >there? Apparently they don't use cls_flower->egress_dev.
Re: [patch net-next 0/4] net: sched: get rid of cls_flower->egress_dev
Tue, Oct 10, 2017 at 07:24:21PM CEST, gerlitz...@gmail.com wrote: >Jiri, > >FWIW, as I reported to you earlier, I was playing with tc encap/decap rules >on 4.14-rc+ (net) before >applying any patch of this series, and something is messy w.r.t to decap >rules. I don't see >them removed at all when user space attempts to do so. It might (probably) >mlx5 bug, which >we will have to fix for net and later rebase net-next over net. We have >short WW here so >I will not be able to do RCA this week. Or, as I replied to you earlier, the issue you describe is totally unrelated to this patchset as you see the issue with the current net-next. Not sure why you add this comment here.
[PATCH] rtl8xxxu: mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Cc: Jes Sorensen Cc: Kalle Valo Cc: linux-wirel...@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Gustavo A. R. Silva --- drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c index 7806a4d..e66be05 100644 --- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c +++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c @@ -1153,6 +1153,7 @@ void rtl8xxxu_gen1_config_channel(struct ieee80211_hw *hw) switch (hw->conf.chandef.width) { case NL80211_CHAN_WIDTH_20_NOHT: ht = false; + /* fall through */ case NL80211_CHAN_WIDTH_20: opmode |= BW_OPMODE_20MHZ; rtl8xxxu_write8(priv, REG_BW_OPMODE, opmode); @@ -1280,6 +1281,7 @@ void rtl8xxxu_gen2_config_channel(struct ieee80211_hw *hw) switch (hw->conf.chandef.width) { case NL80211_CHAN_WIDTH_20_NOHT: ht = false; + /* fall through */ case NL80211_CHAN_WIDTH_20: rf_mode_bw |= WMAC_TRXPTCL_CTL_BW_20; subchannel = 0; @@ -1748,9 +1750,11 @@ static int rtl8xxxu_identify_chip(struct rtl8xxxu_priv *priv) case 3: priv->ep_tx_low_queue = 1; priv->ep_tx_count++; + /* fall through */ case 2: priv->ep_tx_normal_queue = 1; priv->ep_tx_count++; + /* fall through */ case 1: priv->ep_tx_high_queue = 1; priv->ep_tx_count++; @@ -5691,6 +5695,7 @@ static int rtl8xxxu_set_key(struct ieee80211_hw *hw, enum set_key_cmd cmd, break; case WLAN_CIPHER_SUITE_TKIP: key->flags |= IEEE80211_KEY_FLAG_GENERATE_MMIC; + /* fall through */ default: return -EOPNOTSUPP; } -- 2.7.4
Re: [PATCH] rtl8xxxu: mark expected switch fall-throughs
On 10/10/2017 12:35 PM, Jes Sorensen wrote: > On 10/10/2017 03:30 PM, Gustavo A. R. Silva wrote: >> In preparation to enabling -Wimplicit-fallthrough, mark switch cases >> where we are expecting to fall through. > > While this isn't harmful, to me this looks like pointless patch churn > for zero gain and it's just ugly. That is the canonical way to tell static analyzers and compilers that fall throughs are wanted and not accidental mistakes in the code. For people that deal with these kinds of errors, it's quite helpful, unless you suggest disabling that particular GCC warning specific for that file/directory? > > Jes > > >> Cc: Jes Sorensen >> Cc: Kalle Valo >> Cc: linux-wirel...@vger.kernel.org >> Cc: netdev@vger.kernel.org >> Signed-off-by: Gustavo A. R. Silva >> --- >> drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c | 5 + >> 1 file changed, 5 insertions(+) >> >> diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c >> b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c >> index 7806a4d..e66be05 100644 >> --- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c >> +++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c >> @@ -1153,6 +1153,7 @@ void rtl8xxxu_gen1_config_channel(struct >> ieee80211_hw *hw) >> switch (hw->conf.chandef.width) { >> case NL80211_CHAN_WIDTH_20_NOHT: >> ht = false; >> +/* fall through */ >> case NL80211_CHAN_WIDTH_20: >> opmode |= BW_OPMODE_20MHZ; >> rtl8xxxu_write8(priv, REG_BW_OPMODE, opmode); >> @@ -1280,6 +1281,7 @@ void rtl8xxxu_gen2_config_channel(struct >> ieee80211_hw *hw) >> switch (hw->conf.chandef.width) { >> case NL80211_CHAN_WIDTH_20_NOHT: >> ht = false; >> +/* fall through */ >> case NL80211_CHAN_WIDTH_20: >> rf_mode_bw |= WMAC_TRXPTCL_CTL_BW_20; >> subchannel = 0; >> @@ -1748,9 +1750,11 @@ static int rtl8xxxu_identify_chip(struct >> rtl8xxxu_priv *priv) >> case 3: >> priv->ep_tx_low_queue = 1; >> priv->ep_tx_count++; >> +/* fall through */ >> case 2: >> priv->ep_tx_normal_queue = 1; >> priv->ep_tx_count++; >> +/* fall through */ >> case 1: >> priv->ep_tx_high_queue = 1; >> priv->ep_tx_count++; >> @@ -5691,6 +5695,7 @@ static int rtl8xxxu_set_key(struct ieee80211_hw >> *hw, enum set_key_cmd cmd, >> break; >> case WLAN_CIPHER_SUITE_TKIP: >> key->flags |= IEEE80211_KEY_FLAG_GENERATE_MMIC; >> +/* fall through */ >> default: >> return -EOPNOTSUPP; >> } >> > -- Florian
[PATCH net-next v2] openvswitch: add ct_clear action
This adds a ct_clear action for clearing conntrack state. ct_clear is currently implemented in OVS userspace, but is not backed by an action in the kernel datapath. This is useful for flows that may modify a packet tuple after a ct lookup has already occurred. Signed-off-by: Eric Garver --- v2: - Use IP_CT_UNTRACKED for nf_ct_set() - Only fill key if previously conntracked include/uapi/linux/openvswitch.h | 2 ++ net/openvswitch/actions.c| 4 net/openvswitch/conntrack.c | 11 +++ net/openvswitch/conntrack.h | 7 +++ net/openvswitch/flow_netlink.c | 5 + 5 files changed, 29 insertions(+) diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index efdbfbfd3ee2..0cd6f8833147 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -807,6 +807,7 @@ struct ovs_action_push_eth { * packet. * @OVS_ACTION_ATTR_POP_ETH: Pop the outermost Ethernet header off the * packet. + * @OVS_ACTION_ATTR_CT_CLEAR: Clear conntrack state from the packet. * * Only a single header can be set with a single %OVS_ACTION_ATTR_SET. Not all * fields within a header are modifiable, e.g. the IPv4 protocol and fragment @@ -836,6 +837,7 @@ enum ovs_action_attr { OVS_ACTION_ATTR_TRUNC,/* u32 struct ovs_action_trunc. */ OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */ OVS_ACTION_ATTR_POP_ETH, /* No argument. */ + OVS_ACTION_ATTR_CT_CLEAR, /* No argument. */ __OVS_ACTION_ATTR_MAX,/* Nothing past this will be accepted * from userspace. */ diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c index a54a556fcdb5..a551232daf61 100644 --- a/net/openvswitch/actions.c +++ b/net/openvswitch/actions.c @@ -1203,6 +1203,10 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb, return err == -EINPROGRESS ? 0 : err; break; + case OVS_ACTION_ATTR_CT_CLEAR: + err = ovs_ct_clear(skb, key); + break; + case OVS_ACTION_ATTR_PUSH_ETH: err = push_eth(skb, key, nla_data(a)); break; diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index d558e882ca0c..fe861e2f0deb 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -1129,6 +1129,17 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb, return err; } +int ovs_ct_clear(struct sk_buff *skb, struct sw_flow_key *key) +{ + if (skb_nfct(skb)) { + nf_conntrack_put(skb_nfct(skb)); + nf_ct_set(skb, NULL, IP_CT_UNTRACKED); + ovs_ct_fill_key(skb, key); + } + + return 0; +} + static int ovs_ct_add_helper(struct ovs_conntrack_info *info, const char *name, const struct sw_flow_key *key, bool log) { diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h index bc7efd1867ab..399dfdd2c4f9 100644 --- a/net/openvswitch/conntrack.h +++ b/net/openvswitch/conntrack.h @@ -30,6 +30,7 @@ int ovs_ct_action_to_attr(const struct ovs_conntrack_info *, struct sk_buff *); int ovs_ct_execute(struct net *, struct sk_buff *, struct sw_flow_key *, const struct ovs_conntrack_info *); +int ovs_ct_clear(struct sk_buff *skb, struct sw_flow_key *key); void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key); int ovs_ct_put_key(const struct sw_flow_key *swkey, @@ -73,6 +74,12 @@ static inline int ovs_ct_execute(struct net *net, struct sk_buff *skb, return -ENOTSUPP; } +static inline int ovs_ct_clear(struct sk_buff *skb, + struct sw_flow_key *key) +{ + return -ENOTSUPP; +} + static inline void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key) { diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c index fc0ca9a89b8e..dc0d79092e74 100644 --- a/net/openvswitch/flow_netlink.c +++ b/net/openvswitch/flow_netlink.c @@ -76,6 +76,7 @@ static bool actions_may_change_flow(const struct nlattr *actions) break; case OVS_ACTION_ATTR_CT: + case OVS_ACTION_ATTR_CT_CLEAR: case OVS_ACTION_ATTR_HASH: case OVS_ACTION_ATTR_POP_ETH: case OVS_ACTION_ATTR_POP_MPLS: @@ -2528,6 +2529,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr, [OVS_ACTION_ATTR_SAMPLE] = (u32)-1, [OVS_ACTION_ATTR_HASH] = sizeof(struct ovs_action_hash), [OVS_ACTION_ATTR_CT] = (u32)-1, + [OVS_ACTION_ATTR_CT_CLEAR] = 0, [OVS_ACTION_ATTR_TRUNC] = sizeof(struct ovs_action_tr
Re: [ovs-dev] [PATCH net-next] openvswitch: add ct_clear action
On 10 October 2017 at 12:13, Eric Garver wrote: > On Tue, Oct 10, 2017 at 10:24:20AM -0700, Joe Stringer wrote: >> On 10 October 2017 at 08:09, Eric Garver wrote: >> > On Tue, Oct 10, 2017 at 05:33:48AM -0700, Joe Stringer wrote: >> >> On 9 October 2017 at 21:41, Pravin Shelar wrote: >> >> > On Fri, Oct 6, 2017 at 9:44 AM, Eric Garver wrote: >> >> >> This adds a ct_clear action for clearing conntrack state. ct_clear is >> >> >> currently implemented in OVS userspace, but is not backed by an action >> >> >> in the kernel datapath. This is useful for flows that may modify a >> >> >> packet tuple after a ct lookup has already occurred. >> >> >> >> >> >> Signed-off-by: Eric Garver >> >> > Patch mostly looks good. I have following comments. >> >> > >> >> >> --- >> >> >> include/uapi/linux/openvswitch.h | 2 ++ >> >> >> net/openvswitch/actions.c| 5 + >> >> >> net/openvswitch/conntrack.c | 12 >> >> >> net/openvswitch/conntrack.h | 7 +++ >> >> >> net/openvswitch/flow_netlink.c | 5 + >> >> >> 5 files changed, 31 insertions(+) >> >> >> >> >> >> diff --git a/include/uapi/linux/openvswitch.h >> >> >> b/include/uapi/linux/openvswitch.h >> >> >> index 156ee4cab82e..1b6e510e2cc6 100644 >> >> >> --- a/include/uapi/linux/openvswitch.h >> >> >> +++ b/include/uapi/linux/openvswitch.h >> >> >> @@ -806,6 +806,7 @@ struct ovs_action_push_eth { >> >> >> * packet. >> >> >> * @OVS_ACTION_ATTR_POP_ETH: Pop the outermost Ethernet header off the >> >> >> * packet. >> >> >> + * @OVS_ACTION_ATTR_CT_CLEAR: Clear conntrack state from the packet. >> >> >> * >> >> >> * Only a single header can be set with a single >> >> >> %OVS_ACTION_ATTR_SET. Not all >> >> >> * fields within a header are modifiable, e.g. the IPv4 protocol and >> >> >> fragment >> >> >> @@ -835,6 +836,7 @@ enum ovs_action_attr { >> >> >> OVS_ACTION_ATTR_TRUNC,/* u32 struct ovs_action_trunc. >> >> >> */ >> >> >> OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */ >> >> >> OVS_ACTION_ATTR_POP_ETH, /* No argument. */ >> >> >> + OVS_ACTION_ATTR_CT_CLEAR, /* No argument. */ >> >> >> >> >> >> __OVS_ACTION_ATTR_MAX,/* Nothing past this will be >> >> >> accepted >> >> >>* from userspace. */ >> >> >> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c >> >> >> index a54a556fcdb5..db9c7f2e662b 100644 >> >> >> --- a/net/openvswitch/actions.c >> >> >> +++ b/net/openvswitch/actions.c >> >> >> @@ -1203,6 +1203,10 @@ static int do_execute_actions(struct datapath >> >> >> *dp, struct sk_buff *skb, >> >> >> return err == -EINPROGRESS ? 0 : err; >> >> >> break; >> >> >> >> >> >> + case OVS_ACTION_ATTR_CT_CLEAR: >> >> >> + err = ovs_ct_clear(skb, key); >> >> >> + break; >> >> >> + >> >> >> case OVS_ACTION_ATTR_PUSH_ETH: >> >> >> err = push_eth(skb, key, nla_data(a)); >> >> >> break; >> >> >> @@ -1210,6 +1214,7 @@ static int do_execute_actions(struct datapath >> >> >> *dp, struct sk_buff *skb, >> >> >> case OVS_ACTION_ATTR_POP_ETH: >> >> >> err = pop_eth(skb, key); >> >> >> break; >> >> >> + >> >> >> } >> >> > Unrelated change. >> >> > >> >> >> >> >> >> if (unlikely(err)) { >> >> >> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c >> >> >> index d558e882ca0c..f9b73c726ad7 100644 >> >> >> --- a/net/openvswitch/conntrack.c >> >> >> +++ b/net/openvswitch/conntrack.c >> >> >> @@ -1129,6 +1129,18 @@ int ovs_ct_execute(struct net *net, struct >> >> >> sk_buff *skb, >> >> >> return err; >> >> >> } >> >> >> >> >> >> +int ovs_ct_clear(struct sk_buff *skb, struct sw_flow_key *key) >> >> >> +{ >> >> >> + if (skb_nfct(skb)) { >> >> >> + nf_conntrack_put(skb_nfct(skb)); >> >> >> + nf_ct_set(skb, NULL, 0); >> >> > Can the new conntract state be appropriate? may be IP_CT_UNTRACKED? >> >> > >> >> >> + } >> >> >> + >> >> >> + ovs_ct_fill_key(skb, key); >> >> >> + >> >> > I do not see need to refill the key if there is no skb-nf-ct. >> >> >> >> Really this is trying to just zero the CT key fields, but reuses >> >> existing functions, right? This means that subsequent upcalls, for >> > >> > Right. >> > >> >> instance, won't have the outdated view of the CT state from the >> >> previous lookup (that was prior to the ct_clear). I'd expect these key >> >> fields to be cleared. >> > >> > I assumed Pravin was saying that we don't need to clear them if there is >> > no conntrack state. They should already be zero. >> >> The conntrack calls aren't going to clear it, so I don't see what else >> would clear it? >> >> If you execute ct(),ct_clear(), then the first ct wi