Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task
On 07/09/2015 23:38, Parav Pandit wrote: > @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct > *p) > * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring > * subscriptions and synchronises with wait4(). Also used in procfs. Also > * pins the final release of task.io_context. Also protects ->cpuset and > - * ->cgroup.subsys[]. And ->vfork_done. > + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter. s/projtects/protects/ > * > * Nests both inside and outside of read_lock(&tasklist_lock). > * It must not be nested with write_lock_irq(&tasklist_lock), -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup.
On 07/09/2015 23:38, Parav Pandit wrote: > diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h > index 8b64221..cdbdd60 100644 > --- a/include/linux/device_cgroup.h > +++ b/include/linux/device_cgroup.h > @@ -1,6 +1,57 @@ > +#ifndef _DEVICE_CGROUP > +#define _DEVICE_CGROUP > + > #include > +#include > +#include You cannot add this include line before adding the device_rdma_cgroup.h (added in patch 5). You should reorder the patches so that after each patch the kernel builds correctly. I also noticed in patch 2 you add device_rdma_cgroup.o to the Makefile before it was added to the kernel. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 5/5] iw_cxgb4: set the default MPA version to 2
This enables ORD/IRD negotiation and its about time to enable it by default Signed-off-by: Hariprasad Shenai --- drivers/infiniband/hw/cxgb4/cm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c index 0e2741b..79d6855 100644 --- a/drivers/infiniband/hw/cxgb4/cm.c +++ b/drivers/infiniband/hw/cxgb4/cm.c @@ -115,11 +115,11 @@ module_param(ep_timeout_secs, int, 0644); MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout " "in seconds (default=60)"); -static int mpa_rev = 1; +static int mpa_rev = 2; module_param(mpa_rev, int, 0644); MODULE_PARM_DESC(mpa_rev, "MPA Revision, 0 supports amso1100, " "1 is RFC0544 spec compliant, 2 is IETF MPA Peer Connect Draft" - " compliant (default=1)"); + " compliant (default=2)"); static int markers_enabled; module_param(markers_enabled, int, 0644); -- 2.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 4/5] iw_cxgb4: reverse the ord/ird in the ESTABLISHED upcall
The ESTABLISHED event should have the peer's ord/ird so swap the values in the event before the upcall. Signed-off-by: Hariprasad Shenai --- drivers/infiniband/hw/cxgb4/cm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c index 06d208c..0e2741b 100644 --- a/drivers/infiniband/hw/cxgb4/cm.c +++ b/drivers/infiniband/hw/cxgb4/cm.c @@ -1242,8 +1242,8 @@ static void established_upcall(struct c4iw_ep *ep) PDBG("%s ep %p tid %u\n", __func__, ep, ep->hwtid); memset(&event, 0, sizeof(event)); event.event = IW_CM_EVENT_ESTABLISHED; - event.ird = ep->ird; - event.ord = ep->ord; + event.ird = ep->ord; + event.ord = ep->ird; if (ep->com.cm_id) { PDBG("%s ep %p tid %u\n", __func__, ep, ep->hwtid); ep->com.cm_id->event_handler(ep->com.cm_id, &event); -- 2.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 1/5] iw_cxgb4: detect fatal errors while creating listening filters
In c4iw_create_listen(), if we're using listen filters, then bail out of the busy loop if the device becomes fatally dead Signed-off-by: Hariprasad Shenai --- drivers/infiniband/hw/cxgb4/cm.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c index 3ad8dc7..42087c9 100644 --- a/drivers/infiniband/hw/cxgb4/cm.c +++ b/drivers/infiniband/hw/cxgb4/cm.c @@ -3202,6 +3202,10 @@ static int create_server4(struct c4iw_dev *dev, struct c4iw_listen_ep *ep) sin->sin_addr.s_addr, sin->sin_port, 0, ep->com.dev->rdev.lldi.rxq_ids[0], 0, 0); if (err == -EBUSY) { + if (c4iw_fatal_error(&ep->com.dev->rdev)) { + err = -EIO; + break; + } set_current_state(TASK_UNINTERRUPTIBLE); schedule_timeout(usecs_to_jiffies(100)); } -- 2.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 3/5] iw_cxgb4: fix misuse of ep->ord for minimum ird calculation
When calculating the minimum ird in c4iw_accept_cr(), we need to always have a value of at least 1 if the RTR message is a 0B read. The code was incorrectly using ep->ord for this logic which was incorrectly adjusting the ird and causing incorrect ord/ird negotiation when using MPAv2 to negotiate these values. Signed-off-by: Hariprasad Shenai --- drivers/infiniband/hw/cxgb4/cm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c index a26d293..06d208c 100644 --- a/drivers/infiniband/hw/cxgb4/cm.c +++ b/drivers/infiniband/hw/cxgb4/cm.c @@ -2878,7 +2878,7 @@ int c4iw_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) } else { if (peer2peer && (ep->mpa_attr.p2p_type != FW_RI_INIT_P2PTYPE_DISABLED) && - (p2p_type == FW_RI_INIT_P2PTYPE_READ_REQ) && ep->ord == 0) + (p2p_type == FW_RI_INIT_P2PTYPE_READ_REQ) && ep->ird == 0) ep->ird = 1; } -- 2.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 0/5] set MPA revision to 2 and misc. fixes for iw_cxgb4
Hi, This patch series adds the following. Detect errors while creating listening servers, pass ird/ord info in connect reply events, fix misuse of ord for ird calculation and for the ESTABLISHED event we should have the peer's ord/ird so swap the values in the event before the upcall. Set default MPA version to 2. This patch series has been created against Doug's linux tree and includes patches on iw_cxgb4 driver. We have included all the maintainers of respective drivers. Kindly review the change and let us know in case of any review comments. Thanks Hariprasad Shenai (5): iw_cxgb4: detect fatal errors while creating listening filters iw_cxgb4: pass the ord/ird in connect reply events iw_cxgb4: fix misuse of ep->ord for minimum ird calculation iw_cxgb4: reverse the ord/ird in the ESTABLISHED upcall iw_cxgb4: set the default MPA version to 2 drivers/infiniband/hw/cxgb4/cm.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) -- 2.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 2/5] iw_cxgb4: pass the ord/ird in connect reply events
This allows client ULPs to get the negotiated ord/ird which is useful to avoid stalling the SQ due to exceeding the ORD. Signed-off-by: Hariprasad Shenai --- drivers/infiniband/hw/cxgb4/cm.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c index 42087c9..a26d293 100644 --- a/drivers/infiniband/hw/cxgb4/cm.c +++ b/drivers/infiniband/hw/cxgb4/cm.c @@ -1169,6 +1169,8 @@ static void connect_reply_upcall(struct c4iw_ep *ep, int status) if ((status == 0) || (status == -ECONNREFUSED)) { if (!ep->tried_with_mpa_v1) { /* this means MPA_v2 is used */ + event.ord = ep->ird; + event.ird = ep->ord; event.private_data_len = ep->plen - sizeof(struct mpa_v2_conn_params); event.private_data = ep->mpa_pkt + @@ -1176,6 +1178,8 @@ static void connect_reply_upcall(struct c4iw_ep *ep, int status) sizeof(struct mpa_v2_conn_params); } else { /* this means MPA_v1 is used */ + event.ord = cur_max_read_depth(ep->com.dev); + event.ird = cur_max_read_depth(ep->com.dev); event.private_data_len = ep->plen; event.private_data = ep->mpa_pkt + sizeof(struct mpa_message); -- 2.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Hi Doug, Tejun, This is from cgroups for-4.3 branch. linux-rdma trunk will face compilation error as its behind Tejun's for-4.3 branch. Patch has dependency on the some of the cgroup subsystem functionality for fork(). Therefore its required to merge those changes first to linux-rdma trunk. Parav On Tue, Sep 8, 2015 at 2:08 AM, Parav Pandit wrote: > Currently user space applications can easily take away all the rdma > device specific resources such as AH, CQ, QP, MR etc. Due to which other > applications in other cgroup or kernel space ULPs may not even get chance > to allocate any rdma resources. > > This patch-set allows limiting rdma resources to set of processes. > It extend device cgroup controller for limiting rdma device limits. > > With this patch, user verbs module queries rdma device cgroup controller > to query process's limit to consume such resource. It uncharge resource > counter after resource is being freed. > > It extends the task structure to hold the statistic information about > process's > rdma resource usage so that when process migrates from one to other > controller, > right amount of resources can be migrated from one to other cgroup. > > Future patches will support RDMA flows resource and will be enhanced further > to enforce limit of other resources and capabilities. > > Parav Pandit (7): > devcg: Added user option to rdma resource tracking. > devcg: Added rdma resource tracking module. > devcg: Added infrastructure for rdma device cgroup. > devcg: Added rdma resource tracker object per task > devcg: device cgroup's extension for RDMA resource. > devcg: Added support to use RDMA device cgroup. > devcg: Added Documentation of RDMA device cgroup. > > Documentation/cgroups/devices.txt | 32 ++- > drivers/infiniband/core/uverbs_cmd.c | 139 +-- > drivers/infiniband/core/uverbs_main.c | 39 +++- > include/linux/device_cgroup.h | 53 + > include/linux/device_rdma_cgroup.h| 83 +++ > include/linux/sched.h | 12 +- > init/Kconfig | 12 + > security/Makefile | 1 + > security/device_cgroup.c | 119 +++--- > security/device_rdma_cgroup.c | 422 > ++ > 10 files changed, 850 insertions(+), 62 deletions(-) > create mode 100644 include/linux/device_rdma_cgroup.h > create mode 100644 security/device_rdma_cgroup.c > > -- > 1.8.3.1 > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/7] devcg: Added infrastructure for rdma device cgroup.
1. Moved necessary functions and data structures to header file to reuse them at device cgroup white list functionality and for rdma functionality. 2. Added infrastructure to invoke RDMA specific routines for resource configuration, query and during fork handling. 3. Added sysfs interface files for configuring max limit of each rdma resource and one file for querying controllers current resource usage. Signed-off-by: Parav Pandit --- include/linux/device_cgroup.h | 53 +++ security/device_cgroup.c | 119 +- 2 files changed, 136 insertions(+), 36 deletions(-) diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h index 8b64221..cdbdd60 100644 --- a/include/linux/device_cgroup.h +++ b/include/linux/device_cgroup.h @@ -1,6 +1,57 @@ +#ifndef _DEVICE_CGROUP +#define _DEVICE_CGROUP + #include +#include +#include #ifdef CONFIG_CGROUP_DEVICE + +enum devcg_behavior { + DEVCG_DEFAULT_NONE, + DEVCG_DEFAULT_ALLOW, + DEVCG_DEFAULT_DENY, +}; + +/* + * exception list locking rules: + * hold devcgroup_mutex for update/read. + * hold rcu_read_lock() for read. + */ + +struct dev_exception_item { + u32 major, minor; + short type; + short access; + struct list_head list; + struct rcu_head rcu; +}; + +struct dev_cgroup { + struct cgroup_subsys_state css; + struct list_head exceptions; + enum devcg_behavior behavior; + +#ifdef CONFIG_CGROUP_RDMA_RESOURCE + struct devcgroup_rdma rdma; +#endif +}; + +static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state *s) +{ + return s ? container_of(s, struct dev_cgroup, css) : NULL; +} + +static inline struct dev_cgroup *parent_devcgroup(struct dev_cgroup *dev_cg) +{ + return css_to_devcgroup(dev_cg->css.parent); +} + +static inline struct dev_cgroup *task_devcgroup(struct task_struct *task) +{ + return css_to_devcgroup(task_css(task, devices_cgrp_id)); +} + extern int __devcgroup_inode_permission(struct inode *inode, int mask); extern int devcgroup_inode_mknod(int mode, dev_t dev); static inline int devcgroup_inode_permission(struct inode *inode, int mask) @@ -17,3 +68,5 @@ static inline int devcgroup_inode_permission(struct inode *inode, int mask) static inline int devcgroup_inode_mknod(int mode, dev_t dev) { return 0; } #endif + +#endif diff --git a/security/device_cgroup.c b/security/device_cgroup.c index 188c1d2..a0b3239 100644 --- a/security/device_cgroup.c +++ b/security/device_cgroup.c @@ -25,42 +25,6 @@ static DEFINE_MUTEX(devcgroup_mutex); -enum devcg_behavior { - DEVCG_DEFAULT_NONE, - DEVCG_DEFAULT_ALLOW, - DEVCG_DEFAULT_DENY, -}; - -/* - * exception list locking rules: - * hold devcgroup_mutex for update/read. - * hold rcu_read_lock() for read. - */ - -struct dev_exception_item { - u32 major, minor; - short type; - short access; - struct list_head list; - struct rcu_head rcu; -}; - -struct dev_cgroup { - struct cgroup_subsys_state css; - struct list_head exceptions; - enum devcg_behavior behavior; -}; - -static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state *s) -{ - return s ? container_of(s, struct dev_cgroup, css) : NULL; -} - -static inline struct dev_cgroup *task_devcgroup(struct task_struct *task) -{ - return css_to_devcgroup(task_css(task, devices_cgrp_id)); -} - /* * called under devcgroup_mutex */ @@ -223,6 +187,9 @@ devcgroup_css_alloc(struct cgroup_subsys_state *parent_css) INIT_LIST_HEAD(&dev_cgroup->exceptions); dev_cgroup->behavior = DEVCG_DEFAULT_NONE; +#ifdef CONFIG_CGROUP_RDMA_RESOURCE + init_devcgroup_rdma_tracker(dev_cgroup); +#endif return &dev_cgroup->css; } @@ -234,6 +201,25 @@ static void devcgroup_css_free(struct cgroup_subsys_state *css) kfree(dev_cgroup); } +#ifdef CONFIG_CGROUP_RDMA_RESOURCE +static int devcgroup_can_attach(struct cgroup_subsys_state *dst_css, + struct cgroup_taskset *tset) +{ + return devcgroup_rdma_can_attach(dst_css, tset); +} + +static void devcgroup_cancel_attach(struct cgroup_subsys_state *dst_css, + struct cgroup_taskset *tset) +{ + devcgroup_cancel_attach(dst_css, tset); +} + +static void devcgroup_fork(struct task_struct *task, void *priv) +{ + devcgroup_rdma_fork(task, priv); +} +#endif + #define DEVCG_ALLOW 1 #define DEVCG_DENY 2 #define DEVCG_LIST 3 @@ -788,6 +774,62 @@ static struct cftype dev_cgroup_files[] = { .seq_show = devcgroup_seq_show, .private = DEVCG_LIST, }, + +#ifdef CONFIG_CGROUP_RDMA_RESOURCE + { + .name = "rdma.resource.uctx.max", + .write = devcgroup_rdma_set_max_resource, + .seq_show = devcgroup_rdma_get_max_resource, + .private = DEVCG_RDMA_RES_TYPE_UCTX, +
[PATCH 7/7] devcg: Added Documentation of RDMA device cgroup.
Modified device cgroup documentation to reflect its dual purpose without creating new cgroup subsystem for rdma. Added documentation to describe functionality and usage of device cgroup extension for RDMA. Signed-off-by: Parav Pandit --- Documentation/cgroups/devices.txt | 32 +--- 1 file changed, 29 insertions(+), 3 deletions(-) diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt index 3c1095c..eca5b70 100644 --- a/Documentation/cgroups/devices.txt +++ b/Documentation/cgroups/devices.txt @@ -1,9 +1,12 @@ -Device Whitelist Controller +Device Controller 1. Description: -Implement a cgroup to track and enforce open and mknod restrictions -on device files. A device cgroup associates a device access +Device controller implements a cgroup for two purposes. + +1.1 Device white list controller +It implement a cgroup to track and enforce open and mknod +restrictions on device files. A device cgroup associates a device access whitelist with each cgroup. A whitelist entry has 4 fields. 'type' is a (all), c (char), or b (block). 'all' means it applies to all types and all major and minor numbers. Major and minor are @@ -15,8 +18,15 @@ cgroup gets a copy of the parent. Administrators can then remove devices from the whitelist or add new entries. A child cgroup can never receive a device access which is denied by its parent. +1.2 RDMA device resource controller +It implements a cgroup to limit various RDMA device resources for +a controller. Such resource includes RDMA PD, CQ, AH, MR, SRQ, QP, FLOW. +It limits RDMA resources access to tasks of the cgroup across multiple +RDMA devices. + 2. User Interface +2.1 Device white list controller An entry is added using devices.allow, and removed using devices.deny. For instance @@ -33,6 +43,22 @@ will remove the default 'a *:* rwm' entry. Doing will add the 'a *:* rwm' entry to the whitelist. +2.2 RDMA device controller + +RDMA resources are limited using devices.rdma.resource.max.. +Doing + echo 200 > /sys/fs/cgroup/1/rdma.resource.max_qp +will limit maximum number of QP across all the process of cgroup to 200. + +More examples: + echo 200 > /sys/fs/cgroup/1/rdma.resource.max_flow + echo 10 > /sys/fs/cgroup/1/rdma.resource.max_pd + echo 15 > /sys/fs/cgroup/1/rdma.resource.max_srq + echo 1 > /sys/fs/cgroup/1/rdma.resource.max_uctx + +RDMA resource current usage can be tracked using devices.rdma.resource.usage + cat /sys/fs/cgroup/1/devices.rdma.resource.usage + 3. Security Any task can move itself between cgroups. This clearly won't -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/7] devcg: Added support to use RDMA device cgroup.
RDMA uverbs modules now queries associated device cgroup rdma controller before allocating device resources and uncharge them while freeing rdma device resources. Since fput() sequence can free the resources from the workqueue context (instead of task context which allocated the resource), it passes associated ucontext pointer during uncharge, so that rdma cgroup controller can correctly free the resource of right task and right cgroup. Signed-off-by: Parav Pandit --- drivers/infiniband/core/uverbs_cmd.c | 139 +- drivers/infiniband/core/uverbs_main.c | 39 +- 2 files changed, 156 insertions(+), 22 deletions(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index bbb02ff..c080374 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -37,6 +37,7 @@ #include #include #include +#include #include @@ -281,6 +282,19 @@ static void put_xrcd_read(struct ib_uobject *uobj) put_uobj_read(uobj); } +static void init_ucontext_lists(struct ib_ucontext *ucontext) +{ + INIT_LIST_HEAD(&ucontext->pd_list); + INIT_LIST_HEAD(&ucontext->mr_list); + INIT_LIST_HEAD(&ucontext->mw_list); + INIT_LIST_HEAD(&ucontext->cq_list); + INIT_LIST_HEAD(&ucontext->qp_list); + INIT_LIST_HEAD(&ucontext->srq_list); + INIT_LIST_HEAD(&ucontext->ah_list); + INIT_LIST_HEAD(&ucontext->xrcd_list); + INIT_LIST_HEAD(&ucontext->rule_list); +} + ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) @@ -313,22 +327,18 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, (unsigned long) cmd.response + sizeof resp, in_len - sizeof cmd, out_len - sizeof resp); + ret = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_UCTX, 1); + if (ret) + goto err; + ucontext = ibdev->alloc_ucontext(ibdev, &udata); if (IS_ERR(ucontext)) { ret = PTR_ERR(ucontext); - goto err; + goto err_alloc; } ucontext->device = ibdev; - INIT_LIST_HEAD(&ucontext->pd_list); - INIT_LIST_HEAD(&ucontext->mr_list); - INIT_LIST_HEAD(&ucontext->mw_list); - INIT_LIST_HEAD(&ucontext->cq_list); - INIT_LIST_HEAD(&ucontext->qp_list); - INIT_LIST_HEAD(&ucontext->srq_list); - INIT_LIST_HEAD(&ucontext->ah_list); - INIT_LIST_HEAD(&ucontext->xrcd_list); - INIT_LIST_HEAD(&ucontext->rule_list); + init_ucontext_lists(ucontext); rcu_read_lock(); ucontext->tgid = get_task_pid(current->group_leader, PIDTYPE_PID); rcu_read_unlock(); @@ -395,6 +405,8 @@ err_free: put_pid(ucontext->tgid); ibdev->dealloc_ucontext(ucontext); +err_alloc: + devcgroup_rdma_uncharge_resource(NULL, DEVCG_RDMA_RES_TYPE_UCTX, 1); err: mutex_unlock(&file->mutex); return ret; @@ -412,15 +424,23 @@ static void copy_query_dev_fields(struct ib_uverbs_file *file, resp->vendor_id = attr->vendor_id; resp->vendor_part_id= attr->vendor_part_id; resp->hw_ver= attr->hw_ver; - resp->max_qp= attr->max_qp; + resp->max_qp= min_t(int, attr->max_qp, + devcgroup_rdma_query_resource_limit( + DEVCG_RDMA_RES_TYPE_QP)); resp->max_qp_wr = attr->max_qp_wr; resp->device_cap_flags = attr->device_cap_flags; resp->max_sge = attr->max_sge; resp->max_sge_rd= attr->max_sge_rd; - resp->max_cq= attr->max_cq; + resp->max_cq= min_t(int, attr->max_cq, + devcgroup_rdma_query_resource_limit( + DEVCG_RDMA_RES_TYPE_CQ)); resp->max_cqe = attr->max_cqe; - resp->max_mr= attr->max_mr; - resp->max_pd= attr->max_pd; + resp->max_mr= min_t(int, attr->max_mr, + devcgroup_rdma_query_resource_limit( + DEVCG_RDMA_RES_TYPE_MR)); + resp->max_pd= min_t(int, attr->max_pd, + devcgroup_rdma_query_resource_limit( + DEVCG_RDMA_RES_TYPE_PD)); resp->max_qp_rd_atom= attr->max_qp_rd_atom; resp->max_ee_rd_atom= attr->max_ee_rd_atom; resp->max_res_rd_atom = attr->max_res_rd_atom; @@ -429,16 +449,22 @@ static void copy_query_dev_fields(struct ib_uverbs_file *file, resp->atomic_cap= attr->atomic_cap; resp->max_ee= attr->max_ee;
[PATCH 4/7] devcg: Added rdma resource tracker object per task
Added RDMA device resource tracking object per task. Added comments to capture usage of task lock by device cgroup for rdma. Signed-off-by: Parav Pandit --- include/linux/sched.h | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index ae21f15..a5f79b6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1334,6 +1334,8 @@ union rcu_special { }; struct rcu_node; +struct task_rdma_res_counter; + enum perf_event_task_context { perf_invalid_context = -1, perf_hw_context = 0, @@ -1637,6 +1639,14 @@ struct task_struct { struct css_set __rcu *cgroups; /* cg_list protected by css_set_lock and tsk->alloc_lock */ struct list_head cg_list; + +#ifdef CONFIG_CGROUP_RDMA_RESOURCE + /* RDMA resource accounting counters, allocated only +* when RDMA resources are created by a task. +*/ + struct task_rdma_res_counter *rdma_res_counter; +#endif + #endif #ifdef CONFIG_FUTEX struct robust_list_head __user *robust_list; @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct *p) * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring * subscriptions and synchronises with wait4(). Also used in procfs. Also * pins the final release of task.io_context. Also protects ->cpuset and - * ->cgroup.subsys[]. And ->vfork_done. + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter. * * Nests both inside and outside of read_lock(&tasklist_lock). * It must not be nested with write_lock_irq(&tasklist_lock), -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/7] devcg: device cgroup's extension for RDMA resource.
Extension of device cgroup for RDMA device resources. This implements RDMA resource tracker to limit RDMA resources such as AH, CQ, PD, QP, MR, SRQ etc resources for processes of the cgroup. It implements RDMA resource limit module to limit consuming RDMA resources for processes of the cgroup. RDMA resources are tracked on per task basis. RDMA resources across multiple such devices are limited among multiple processes of the owning device cgroup. RDMA device cgroup extension returns error when user space applications try to allocate resources more than its configured limit. Signed-off-by: Parav Pandit --- include/linux/device_rdma_cgroup.h | 83 security/device_rdma_cgroup.c | 422 + 2 files changed, 505 insertions(+) create mode 100644 include/linux/device_rdma_cgroup.h create mode 100644 security/device_rdma_cgroup.c diff --git a/include/linux/device_rdma_cgroup.h b/include/linux/device_rdma_cgroup.h new file mode 100644 index 000..a2c261b --- /dev/null +++ b/include/linux/device_rdma_cgroup.h @@ -0,0 +1,83 @@ +#ifndef _DEVICE_RDMA_CGROUP_H +#define _DEVICE_RDMA_CGROUP_H + +#include + +/* RDMA resources from device cgroup perspective */ +enum devcgroup_rdma_rt { + DEVCG_RDMA_RES_TYPE_UCTX, + DEVCG_RDMA_RES_TYPE_CQ, + DEVCG_RDMA_RES_TYPE_PD, + DEVCG_RDMA_RES_TYPE_AH, + DEVCG_RDMA_RES_TYPE_MR, + DEVCG_RDMA_RES_TYPE_MW, + DEVCG_RDMA_RES_TYPE_SRQ, + DEVCG_RDMA_RES_TYPE_QP, + DEVCG_RDMA_RES_TYPE_FLOW, + DEVCG_RDMA_RES_TYPE_MAX, +}; + +struct ib_ucontext; + +#define DEVCG_RDMA_MAX_RESOURCES S32_MAX + +#ifdef CONFIG_CGROUP_RDMA_RESOURCE + +#define DEVCG_RDMA_MAX_RESOURCE_STR "max" + +enum devcgroup_rdma_access_files { + DEVCG_RDMA_LIST_USAGE, +}; + +struct task_rdma_res_counter { + /* allows atomic increment of task and cgroup counters +* to avoid race with migration task. +*/ + spinlock_t lock; + u32 usage[DEVCG_RDMA_RES_TYPE_MAX]; +}; + +struct devcgroup_rdma_tracker { + int limit; + atomic_t usage; + int failcnt; +}; + +struct devcgroup_rdma { + struct devcgroup_rdma_tracker tracker[DEVCG_RDMA_RES_TYPE_MAX]; +}; + +struct dev_cgroup; + +void init_devcgroup_rdma_tracker(struct dev_cgroup *dev_cg); +ssize_t devcgroup_rdma_set_max_resource(struct kernfs_open_file *of, + char *buf, + size_t nbytes, loff_t off); +int devcgroup_rdma_get_max_resource(struct seq_file *m, void *v); +int devcgroup_rdma_show_usage(struct seq_file *m, void *v); + +int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num); +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, + enum devcgroup_rdma_rt type, int num); +void devcgroup_rdma_fork(struct task_struct *task, void *priv); + +int devcgroup_rdma_can_attach(struct cgroup_subsys_state *css, + struct cgroup_taskset *tset); +void devcgroup_rdma_cancel_attach(struct cgroup_subsys_state *css, + struct cgroup_taskset *tset); +int devcgroup_rdma_query_resource_limit(enum devcgroup_rdma_rt type); +#else + +static inline int devcgroup_rdma_try_charge_resource( + enum devcgroup_rdma_rt type, int num) +{ return 0; } +static inline void devcgroup_rdma_uncharge_resource( + struct ib_ucontext *ucontext, + enum devcgroup_rdma_rt type, int num) +{ } +static inline int devcgroup_rdma_query_resource_limit( + enum devcgroup_rdma_rt type) +{ return DEVCG_RDMA_MAX_RESOURCES; } +#endif + +#endif diff --git a/security/device_rdma_cgroup.c b/security/device_rdma_cgroup.c new file mode 100644 index 000..fb4cc59 --- /dev/null +++ b/security/device_rdma_cgroup.c @@ -0,0 +1,422 @@ +/* + * RDMA device cgroup controller of device controller cgroup. + * + * Provides a cgroup hierarchy to limit various RDMA resource allocation to a + * configured limit of the cgroup. + * + * Its easy for user space applications to consume of RDMA device specific + * hardware resources. Such resource exhaustion should be prevented so that + * user space applications and other kernel consumers gets chance to allocate + * and effectively use the hardware resources. + * + * In order to use the device rdma controller, set the maximum resource count + * per cgroup, which ensures that total rdma resources for processes belonging + * to a cgroup doesn't exceed configured limit. + * + * RDMA resource limits are hierarchical, so the highest configured limit of + * the hierarchy is enforced. Allowing resource limit configuration to default + * cgroup allows fair share to kernel space ULPs as well. + * + * This file is subject to the terms and conditions of version 2 of the GNU + * General Public License. See the file COPYING in the
[PATCH 2/7] devcg: Added rdma resource tracking module.
Added RDMA resource tracking object of device cgroup. Signed-off-by: Parav Pandit --- security/Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/security/Makefile b/security/Makefile index c9bfbc8..c9ad56d 100644 --- a/security/Makefile +++ b/security/Makefile @@ -23,6 +23,7 @@ obj-$(CONFIG_SECURITY_TOMOYO) += tomoyo/ obj-$(CONFIG_SECURITY_APPARMOR)+= apparmor/ obj-$(CONFIG_SECURITY_YAMA)+= yama/ obj-$(CONFIG_CGROUP_DEVICE)+= device_cgroup.o +obj-$(CONFIG_CGROUP_RDMA_RESOURCE) += device_rdma_cgroup.o # Object integrity file lists subdir-$(CONFIG_INTEGRITY) += integrity -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] devcg: Added user option to rdma resource tracking.
Added user configuration option to enable/disable RDMA resource tracking feature of device cgroup as sub module. Signed-off-by: Parav Pandit --- init/Kconfig | 12 1 file changed, 12 insertions(+) diff --git a/init/Kconfig b/init/Kconfig index 2184b34..089db85 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -977,6 +977,18 @@ config CGROUP_DEVICE Provides a cgroup implementing whitelists for devices which a process in the cgroup can mknod or open. +config CGROUP_RDMA_RESOURCE + bool "RDMA Resource Controller for cgroups" + depends on CGROUP_DEVICE + default n + help + This option enables limiting rdma resources for a device cgroup. + Using this option, user space processes can be limited to use + limited number of RDMA resources such as MR, PD, QP, AH, FLOW, CQ + etc. + + Say N if unsure. + config CPUSETS bool "Cpuset support" help -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/7] devcg: device cgroup extension for rdma resource
Currently user space applications can easily take away all the rdma device specific resources such as AH, CQ, QP, MR etc. Due to which other applications in other cgroup or kernel space ULPs may not even get chance to allocate any rdma resources. This patch-set allows limiting rdma resources to set of processes. It extend device cgroup controller for limiting rdma device limits. With this patch, user verbs module queries rdma device cgroup controller to query process's limit to consume such resource. It uncharge resource counter after resource is being freed. It extends the task structure to hold the statistic information about process's rdma resource usage so that when process migrates from one to other controller, right amount of resources can be migrated from one to other cgroup. Future patches will support RDMA flows resource and will be enhanced further to enforce limit of other resources and capabilities. Parav Pandit (7): devcg: Added user option to rdma resource tracking. devcg: Added rdma resource tracking module. devcg: Added infrastructure for rdma device cgroup. devcg: Added rdma resource tracker object per task devcg: device cgroup's extension for RDMA resource. devcg: Added support to use RDMA device cgroup. devcg: Added Documentation of RDMA device cgroup. Documentation/cgroups/devices.txt | 32 ++- drivers/infiniband/core/uverbs_cmd.c | 139 +-- drivers/infiniband/core/uverbs_main.c | 39 +++- include/linux/device_cgroup.h | 53 + include/linux/device_rdma_cgroup.h| 83 +++ include/linux/sched.h | 12 +- init/Kconfig | 12 + security/Makefile | 1 + security/device_cgroup.c | 119 +++--- security/device_rdma_cgroup.c | 422 ++ 10 files changed, 850 insertions(+), 62 deletions(-) create mode 100644 include/linux/device_rdma_cgroup.h create mode 100644 security/device_rdma_cgroup.c -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html