Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20
On Fri, Jul 28, 2017 at 8:38 PM, Logan Gunthorpe <log...@deltatee.com> wrote: > Hi, > > My system has been failing with recent kernels (4.12.x and 4.13-rc2) > with a NULL pointer dereference at the stack trace given at the end of > this email. This happens when simply running 'ib_write_bw -R ' > with a Chelsio T6 (cxgb4). I've bisected (log attached) to find the > offending commit to be: > > commit 1e7710f3f6563940bb6bbc94aa8eadfd344a86af > Author: Matan Barak <mat...@mellanox.com> > IB/core: Change completion channel to use the reworked objects schema > > Reverting this commit (and the dependent commits db1b5ddd53365 and > e0fcc61113c that also fix other bugs with this commit) from v4.12.3 > fixes the issue. > > I did the bisect with the userspace libraries in Debian Stretch but I > also had this bug with rdma-core v14. I was pretty sure v4.12 kernels > worked for me in the past but likely only before I upgraded from Jessie > to Stretch. > > Thanks, > > Logan > Hi Logan, I've tried to reproduce this in my setup (ConnectX 4, RoCE mode) using 1e7710f3f6563940bb6bbc94aa8eadfd344a86af as the kernel's head. I've used d779dd9a9e8f as rdma-core user-space and the latest perftest bits. I couldn't reproduce this problem. I'll try to review this commit again, but please provide more information. For example, do you see the iwpm_register_pid error when these commits are reverted? Does this also happen when using the plain rdma-cm examples (ucmatose, rping)? Does it happen in a plain verbs application (ibv_rc_pingpong)? I assume you use iWarp, right? Did you test other modes? Did you reproduce this issue with your ConnectX 4 as well? Could you please reproduce it with KASAN as well? PS, e0fcc61113c isn't a bug fix, it's just a simple refactor. Regards, Matan > > PS. As a side rant, this bug was found after a very *frustrating* day of > what was supposed to be the 20 minute task of getting my RDMA cards > plugged in again. I tried with both CX4s and the T6s (and I'm still not > sure if my CX4s work yet). Instead, it turns out there's a whole mess of > bugs in the kernel I had to go up against. I went back and forth between > different versions of the userspace libraries because I was sure 4.11 > worked -- but it turned out 4.11.10+, 4.12.x and who knows what other > stable kernels are currently broken by the bug fixed in [1]. And there > was a whole other bug that broke things that was fixed in the 4.12-rc > series that I had to carefully bisect around to find the one reported > above. So frustrating!! > > [1] 5a7a88f1b488e4ee49eb3d5b82612d4d9ffdf2c3 > > -- > > [ 53.320439] iwpm_register_pid: Unable to send a nlmsg (client = 2) > [ 54.738579] BUG: unable to handle kernel NULL pointer dereference at > 0058 > [ 54.747439] IP: _raw_spin_lock_irqsave+0x10/0x30 > [ 54.752719] PGD 0 > [ 54.752721] P4D 0 > [ 54.755049] > [ 54.759109] Oops: 0002 [#1] SMP > [ 54.762699] Modules linked in: > [ 54.766195] CPU: 0 PID: 5 Comm: kworker/u16:0 Not tainted > 4.13.0-rc2.direct #708 > [ 54.774536] Hardware name: Supermicro SYS-7047GR-TRF/X9DRG-QF, BIOS > 3.0a 12/05/2013 > [ 54.783182] Workqueue: iw_cxgb4 process_work > [ 54.788036] task: 880276a5ee80 task.stack: c90c4000 > [ 54.794728] RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30 > [ 54.800552] RSP: 0018:c90c7c70 EFLAGS: 00010046 > [ 54.806473] RAX: RBX: 0002 RCX: > > [ 54.814524] RDX: 0001 RSI: 0058 RDI: > 0058 > [ 54.822583] RBP: 880470484600 R08: 0001 R09: > 0001 > [ 54.830663] R10: 0040 R11: 88047420b400 R12: > 0282 > [ 54.838744] R13: c90c7dc0 R14: 0001 R15: > 880470484600 > [ 54.846825] FS: () GS:880277c0() > knlGS: > [ 54.855997] CS: 0010 DS: ES: CR0: 80050033 > [ 54.862522] CR2: 0058 CR3: 01e0a000 CR4: > 000406f0 > [ 54.870602] Call Trace: > [ 54.873442] ? ib_uverbs_comp_handler+0x20/0xe0 > [ 54.878610] ? flush_qp+0x6e/0x2b0 > [ 54.882514] ? c4iw_modify_qp+0x11c2/0x1870 > [ 54.887295] ? close_con_rpl+0xe7/0x170 > [ 54.891686] ? kfree_skb+0x33/0x90 > [ 54.895592] ? skb_dequeue+0x52/0x60 > [ 54.899690] ? process_work+0x4a/0x60 > [ 54.903887] ? process_one_work+0x1c2/0x3e0 > [ 54.908664] ? worker_thread+0x47/0x3d0 > [ 54.913056] ? kthread+0xfc/0x130 > [ 54.916864] ? create_worker+0x180/0x180 > [ 54.921353] ? kthread_create_on_node+0x40/0x40 > [ 54.926521] ? ret_from_fork+0x22/0x30 > [ 54.930811] Code: c0 74 05 e8 b3 1c 73 ff 48 89
Re: BUG: NULL pointer dereference at ib_uverbs_comp_handler+0x20
On Fri, Jul 28, 2017 at 8:38 PM, Logan Gunthorpe wrote: > Hi, > > My system has been failing with recent kernels (4.12.x and 4.13-rc2) > with a NULL pointer dereference at the stack trace given at the end of > this email. This happens when simply running 'ib_write_bw -R ' > with a Chelsio T6 (cxgb4). I've bisected (log attached) to find the > offending commit to be: > > commit 1e7710f3f6563940bb6bbc94aa8eadfd344a86af > Author: Matan Barak > IB/core: Change completion channel to use the reworked objects schema > > Reverting this commit (and the dependent commits db1b5ddd53365 and > e0fcc61113c that also fix other bugs with this commit) from v4.12.3 > fixes the issue. > > I did the bisect with the userspace libraries in Debian Stretch but I > also had this bug with rdma-core v14. I was pretty sure v4.12 kernels > worked for me in the past but likely only before I upgraded from Jessie > to Stretch. > > Thanks, > > Logan > Hi Logan, I've tried to reproduce this in my setup (ConnectX 4, RoCE mode) using 1e7710f3f6563940bb6bbc94aa8eadfd344a86af as the kernel's head. I've used d779dd9a9e8f as rdma-core user-space and the latest perftest bits. I couldn't reproduce this problem. I'll try to review this commit again, but please provide more information. For example, do you see the iwpm_register_pid error when these commits are reverted? Does this also happen when using the plain rdma-cm examples (ucmatose, rping)? Does it happen in a plain verbs application (ibv_rc_pingpong)? I assume you use iWarp, right? Did you test other modes? Did you reproduce this issue with your ConnectX 4 as well? Could you please reproduce it with KASAN as well? PS, e0fcc61113c isn't a bug fix, it's just a simple refactor. Regards, Matan > > PS. As a side rant, this bug was found after a very *frustrating* day of > what was supposed to be the 20 minute task of getting my RDMA cards > plugged in again. I tried with both CX4s and the T6s (and I'm still not > sure if my CX4s work yet). Instead, it turns out there's a whole mess of > bugs in the kernel I had to go up against. I went back and forth between > different versions of the userspace libraries because I was sure 4.11 > worked -- but it turned out 4.11.10+, 4.12.x and who knows what other > stable kernels are currently broken by the bug fixed in [1]. And there > was a whole other bug that broke things that was fixed in the 4.12-rc > series that I had to carefully bisect around to find the one reported > above. So frustrating!! > > [1] 5a7a88f1b488e4ee49eb3d5b82612d4d9ffdf2c3 > > -- > > [ 53.320439] iwpm_register_pid: Unable to send a nlmsg (client = 2) > [ 54.738579] BUG: unable to handle kernel NULL pointer dereference at > 0058 > [ 54.747439] IP: _raw_spin_lock_irqsave+0x10/0x30 > [ 54.752719] PGD 0 > [ 54.752721] P4D 0 > [ 54.755049] > [ 54.759109] Oops: 0002 [#1] SMP > [ 54.762699] Modules linked in: > [ 54.766195] CPU: 0 PID: 5 Comm: kworker/u16:0 Not tainted > 4.13.0-rc2.direct #708 > [ 54.774536] Hardware name: Supermicro SYS-7047GR-TRF/X9DRG-QF, BIOS > 3.0a 12/05/2013 > [ 54.783182] Workqueue: iw_cxgb4 process_work > [ 54.788036] task: 880276a5ee80 task.stack: c90c4000 > [ 54.794728] RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30 > [ 54.800552] RSP: 0018:c90c7c70 EFLAGS: 00010046 > [ 54.806473] RAX: RBX: 0002 RCX: > > [ 54.814524] RDX: 0001 RSI: 0058 RDI: > 0058 > [ 54.822583] RBP: 880470484600 R08: 0001 R09: > 0001 > [ 54.830663] R10: 0040 R11: 88047420b400 R12: > 0282 > [ 54.838744] R13: c90c7dc0 R14: 0001 R15: > 880470484600 > [ 54.846825] FS: () GS:880277c0() > knlGS: > [ 54.855997] CS: 0010 DS: ES: CR0: 80050033 > [ 54.862522] CR2: 0058 CR3: 01e0a000 CR4: > 000406f0 > [ 54.870602] Call Trace: > [ 54.873442] ? ib_uverbs_comp_handler+0x20/0xe0 > [ 54.878610] ? flush_qp+0x6e/0x2b0 > [ 54.882514] ? c4iw_modify_qp+0x11c2/0x1870 > [ 54.887295] ? close_con_rpl+0xe7/0x170 > [ 54.891686] ? kfree_skb+0x33/0x90 > [ 54.895592] ? skb_dequeue+0x52/0x60 > [ 54.899690] ? process_work+0x4a/0x60 > [ 54.903887] ? process_one_work+0x1c2/0x3e0 > [ 54.908664] ? worker_thread+0x47/0x3d0 > [ 54.913056] ? kthread+0xfc/0x130 > [ 54.916864] ? create_worker+0x180/0x180 > [ 54.921353] ? kthread_create_on_node+0x40/0x40 > [ 54.926521] ? ret_from_fork+0x22/0x30 > [ 54.930811] Code: c0 74 05 e8 b3 1c 73 ff 48 89 d8 5b c3 0f 1f 40 00 > 66 2e 0f 1f 84 00
Re: [PATCH] IB/core: fix semicolon.cocci warnings
On Wed, Jun 7, 2017 at 11:26 PM, Julia Lawallwrote: > Remove unneeded semicolon. > > Generated by: scripts/coccinelle/misc/semicolon.cocci Thanks. I'll fix this in the next version of this series.
Re: [PATCH] IB/core: fix semicolon.cocci warnings
On Wed, Jun 7, 2017 at 11:26 PM, Julia Lawall wrote: > Remove unneeded semicolon. > > Generated by: scripts/coccinelle/misc/semicolon.cocci Thanks. I'll fix this in the next version of this series.
Re: [PATCH] net/mlx5: drop duplicate header delay.h
On 24/11/2016 15:58, Geliang Tang wrote: Drop duplicate header delay.h from mlx5/core/main.c. Signed-off-by: Geliang Tang <geliangt...@gmail.com> --- drivers/net/ethernet/mellanox/mlx5/core/main.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c index f28df33..d7a55eb 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c @@ -46,7 +46,6 @@ #include #include #include -#include #include #ifdef CONFIG_RFS_ACCEL #include Thanks. Acked-by: Matan Barak <mat...@mellanox.com>
Re: [PATCH] net/mlx5: drop duplicate header delay.h
On 24/11/2016 15:58, Geliang Tang wrote: Drop duplicate header delay.h from mlx5/core/main.c. Signed-off-by: Geliang Tang --- drivers/net/ethernet/mellanox/mlx5/core/main.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c index f28df33..d7a55eb 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c @@ -46,7 +46,6 @@ #include #include #include -#include #include #ifdef CONFIG_RFS_ACCEL #include Thanks. Acked-by: Matan Barak
Re: [PATCH] net/mlx5: Simplify a test
On 01/11/2016 09:10, Christophe JAILLET wrote: 'create_root_ns()' does not return an error pointer, so the test can be simplified to be more consistent. Signed-off-by: Christophe JAILLET <christophe.jail...@wanadoo.fr> --- drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c index 904853f9cf7a..330955f6badc 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c @@ -1833,7 +1833,7 @@ static int init_root_ns(struct mlx5_flow_steering *steering) { steering->root_ns = create_root_ns(steering, FS_FT_NIC_RX); - if (IS_ERR_OR_NULL(steering->root_ns)) + if (!steering->root_ns) goto cleanup; if (init_root_tree(steering, _fs, >root_ns->ns.node)) Thanks. Acked-by: Matan Barak <mat...@mellanox.com>
Re: [PATCH] net/mlx5: Simplify a test
On 01/11/2016 09:10, Christophe JAILLET wrote: 'create_root_ns()' does not return an error pointer, so the test can be simplified to be more consistent. Signed-off-by: Christophe JAILLET --- drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c index 904853f9cf7a..330955f6badc 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c @@ -1833,7 +1833,7 @@ static int init_root_ns(struct mlx5_flow_steering *steering) { steering->root_ns = create_root_ns(steering, FS_FT_NIC_RX); - if (IS_ERR_OR_NULL(steering->root_ns)) + if (!steering->root_ns) goto cleanup; if (init_root_tree(steering, _fs, >root_ns->ns.node)) Thanks. Acked-by: Matan Barak
Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
On 14/09/2016 10:06, Parav Pandit wrote: Hi Dennis, Do you know how would HFI1 driver would work along with rdma cgroup? Hi Matan, Leon, Jason, Apart from HFI1, is there any other concern? I just wonder how things like RSS will work. For example, a RSS QP doesn't really have a queue (if I recall, it's connected to work queues via an indirection table). So, when a user creates such a QP, do you want to account it as a regular QP? How are work queues accounted? Or Patch is good to go? 4.8 dates are close by (2 weeks) and there are two git trees involved (that might cause merge error to Linus) so if there are no issues, I would like to make request to Doug to consider it for 4.8 early on. Parav On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovskywrote: On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote: On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote: I've posted some initial work toward a) a while ago, and once we Did it get merged? Do you have a pointer? http://www.spinics.net/lists/linux-rdma/msg31958.html Right, I remember that. Certainly the right direction However, everything under verbs is not straightforward. The files in userspace are not copies... user: struct ibv_query_device { __u32 command; __u16 in_words; __u16 out_words; __u64 response; __u64 driver_data[0]; }; kernel: struct ib_uverbs_query_device { __u64 response; __u64 driver_data[0]; }; We'll obviously need different strutures for the libibvers API and the kernel interface in this case, and we'll need to figure out how to properly translate them. I think a cast, plus compile time type checking ala BUILD_BUG_ON is the way to go. I'm not sure I follow, which would I cast? BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) + sizeof(ib_uverbs_query_device)) ? I'm thinking the best way forward might be to use a script and transform userspace into: struct ibv_query_device { struct ib_uverbs_cmd_hdr hdr; struct ib_uverbs_query_device cmd; }; That would break the users of the interface. Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI identical the modified libibverbs would still be binary compatible with all providers but not source compatible. Since all kernel supported providers are in rdma-plumbing we can add the '.cmd.' at the same time. The kernel uapi header would stay the same. However automatically generating the user ABI from the kernel one might still be a good idea in the long run. My preference would be to try and use the kernel headers directly. I thought the same, especially after realizing that they are almost copy/paste from the vendor *-abi.h files. Jason
Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
On 14/09/2016 10:06, Parav Pandit wrote: Hi Dennis, Do you know how would HFI1 driver would work along with rdma cgroup? Hi Matan, Leon, Jason, Apart from HFI1, is there any other concern? I just wonder how things like RSS will work. For example, a RSS QP doesn't really have a queue (if I recall, it's connected to work queues via an indirection table). So, when a user creates such a QP, do you want to account it as a regular QP? How are work queues accounted? Or Patch is good to go? 4.8 dates are close by (2 weeks) and there are two git trees involved (that might cause merge error to Linus) so if there are no issues, I would like to make request to Doug to consider it for 4.8 early on. Parav On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky wrote: On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote: On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote: I've posted some initial work toward a) a while ago, and once we Did it get merged? Do you have a pointer? http://www.spinics.net/lists/linux-rdma/msg31958.html Right, I remember that. Certainly the right direction However, everything under verbs is not straightforward. The files in userspace are not copies... user: struct ibv_query_device { __u32 command; __u16 in_words; __u16 out_words; __u64 response; __u64 driver_data[0]; }; kernel: struct ib_uverbs_query_device { __u64 response; __u64 driver_data[0]; }; We'll obviously need different strutures for the libibvers API and the kernel interface in this case, and we'll need to figure out how to properly translate them. I think a cast, plus compile time type checking ala BUILD_BUG_ON is the way to go. I'm not sure I follow, which would I cast? BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) + sizeof(ib_uverbs_query_device)) ? I'm thinking the best way forward might be to use a script and transform userspace into: struct ibv_query_device { struct ib_uverbs_cmd_hdr hdr; struct ib_uverbs_query_device cmd; }; That would break the users of the interface. Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI identical the modified libibverbs would still be binary compatible with all providers but not source compatible. Since all kernel supported providers are in rdma-plumbing we can add the '.cmd.' at the same time. The kernel uapi header would stay the same. However automatically generating the user ABI from the kernel one might still be a good idea in the long run. My preference would be to try and use the kernel headers directly. I thought the same, especially after realizing that they are almost copy/paste from the vendor *-abi.h files. Jason
Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
On 10/09/2016 20:01, Jason Gunthorpe wrote: On Sat, Sep 10, 2016 at 06:14:42PM +0200, Christoph Hellwig wrote: OFVWG meetings have absolutely zero relevance for Linux development. Well, to be fair there are a fair number of kernel developers on that particular call.. More "flexibility" for drivers just means giving up on designing a coherent API and leaving it to drivers authors to add crap to their own drivers. That's a major step backwards. Sadly, it isn't a step backwards, it is status quo - at least as far as the uapi is concerned. Every single user space driver has its own private abi file, carefully hidden in their driver, and dutifully copied over to user space: providers/cxgb3/iwch-abi.h providers/cxgb4/cxgb4-abi.h providers/hfi1verbs/hfi-abi.h providers/i40iw/i40iw-abi.h providers/ipathverbs/ipath-abi.h providers/mlx4/mlx4-abi.h providers/mlx5/mlx5-abi.h providers/mthca/mthca-abi.h providers/nes/nes-abi.h providers/ocrdma/ocrdma_abi.h providers/rxe/rxe-abi.h Just to pick two random examples: struct mlx5_create_cq { struct ibv_create_cqibv_cmd; __u64 buf_addr; __u64 db_addr; __u32 cqe_size; }; struct iwch_create_cq { struct ibv_create_cq ibv_cmd; uint64_t user_rptr_addr; }; Love to hear ideas on a way forward that doesn't involve rewriting everything :( Yeah, unfortunately, the RDMA ABI is more driver specific ABI than a common user-kernel ABI. I guess this will become even worse, as the RDMA subsystem is evolving to serve more drivers with different object types. For example, I would like to hear how hfi1 are going to define their user-kernel ABI (once they leave the custom ioctls). They should not be using the code in drivers/infiniband. usnic is such an example of a driver that should never have been added in it's current form. +1 Jason Matan
Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
On 10/09/2016 20:01, Jason Gunthorpe wrote: On Sat, Sep 10, 2016 at 06:14:42PM +0200, Christoph Hellwig wrote: OFVWG meetings have absolutely zero relevance for Linux development. Well, to be fair there are a fair number of kernel developers on that particular call.. More "flexibility" for drivers just means giving up on designing a coherent API and leaving it to drivers authors to add crap to their own drivers. That's a major step backwards. Sadly, it isn't a step backwards, it is status quo - at least as far as the uapi is concerned. Every single user space driver has its own private abi file, carefully hidden in their driver, and dutifully copied over to user space: providers/cxgb3/iwch-abi.h providers/cxgb4/cxgb4-abi.h providers/hfi1verbs/hfi-abi.h providers/i40iw/i40iw-abi.h providers/ipathverbs/ipath-abi.h providers/mlx4/mlx4-abi.h providers/mlx5/mlx5-abi.h providers/mthca/mthca-abi.h providers/nes/nes-abi.h providers/ocrdma/ocrdma_abi.h providers/rxe/rxe-abi.h Just to pick two random examples: struct mlx5_create_cq { struct ibv_create_cqibv_cmd; __u64 buf_addr; __u64 db_addr; __u32 cqe_size; }; struct iwch_create_cq { struct ibv_create_cq ibv_cmd; uint64_t user_rptr_addr; }; Love to hear ideas on a way forward that doesn't involve rewriting everything :( Yeah, unfortunately, the RDMA ABI is more driver specific ABI than a common user-kernel ABI. I guess this will become even worse, as the RDMA subsystem is evolving to serve more drivers with different object types. For example, I would like to hear how hfi1 are going to define their user-kernel ABI (once they leave the custom ioctls). They should not be using the code in drivers/infiniband. usnic is such an example of a driver that should never have been added in it's current form. +1 Jason Matan
Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
On 10/09/2016 19:12, Christoph Hellwig wrote: On Wed, Sep 07, 2016 at 01:25:13PM +0530, Parav Pandit wrote: a) delay cgroups support until the grand rewrite is done b) add it now and deal with the consequences later Can we do (b) now and differ adding any HW resources to cgroup until they are clearly called out. Architecture and APIs are already in place to support this. Sounds fine to me. The only thing I want to avoid is pie in the sky "future proofing" that leads to horrible architectures. And I assume that's what Matan proposed. NO, that not what I proposed. User-kernel API/ABI should be designed with drivers differences in mind. The internal design or internals APIs could ignore such things as they can be changed later.
Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
On 10/09/2016 19:12, Christoph Hellwig wrote: On Wed, Sep 07, 2016 at 01:25:13PM +0530, Parav Pandit wrote: a) delay cgroups support until the grand rewrite is done b) add it now and deal with the consequences later Can we do (b) now and differ adding any HW resources to cgroup until they are clearly called out. Architecture and APIs are already in place to support this. Sounds fine to me. The only thing I want to avoid is pie in the sky "future proofing" that leads to horrible architectures. And I assume that's what Matan proposed. NO, that not what I proposed. User-kernel API/ABI should be designed with drivers differences in mind. The internal design or internals APIs could ignore such things as they can be changed later.
Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
On 07/09/2016 10:55, Parav Pandit wrote: Hi Matan, On Thu, Sep 1, 2016 at 2:14 PM, Christoph Hellwig <h...@lst.de> wrote: On Thu, Sep 01, 2016 at 10:25:40AM +0300, Matan Barak wrote: Well, if I recall, the reason doing so last time was in order to allow flexible updating of ib_core independently, which is obviously not a good reason (to say the least). Since the new ABI will probably define new object types (all recent proposals go this way), the current approach could lead to either trying to map new objects to existing cgroup resource types, which could lead to some weird non 1:1 mapping, or having a split definitions - such that each driver will declare its objects both in the cgroups mechanism and in its driver dispatch table. Even worse than that, drivers could simply ignore the cgroups support while implementing their own resource types and get a very broken containers support. If drivers are broken due to ignorance of not-calling cgroup APIs, that should be ok. That particular driver should fix it. If the resource creation using uverbs is using well defined rdma level resource, uverbs level will make sure to honor cgroup limits without involving hw drivers anyway. All recent proposals of the new ABI schema deals with extending the flexibility of the current schema by letting drivers define their specific types, actions, attributes, etc. Even more than that, the dispatching starts from the driver and it chooses if it wants to use the common RDMA core layer or have it's own wise implementation instead. Some drivers might even prefer not to implement the current verbs types. These decisions were made in the OFVWG meetings. Anyway, maybe we should consider using a more higher-level logic objects that could fit multiple drivers requirements. RDMA Verb level resource is charged/uncharged by RDMA core. RDMA HW level resource is charged/uncharged by RDMA HW driver using well defined API and resource by cgroup core. This scheme ensures that there is 1:1 mapping. Sounds reasonable, but what about drivers which ignore the common code and implement it in their own way? What about drivers which don't support the standard RDMA types at all? I guess we should find a balance between something abstract and common enough that will ease administrator configuration but be specific enough for the various models we have (or will have) in the RDMA stack. I don't think current definition of resource type is carved out on stone. They can be extended as we forward. As many of us agree that, they should be well defined and it should be agreed by cgroup and rdma community. Of course, but since the ABI and cgroups model are somehow related, they should be dealt with together and approved by Doug who participated in some of the OFVWG meetings. For example, today we have RDMA_VERB_xxx resources. New well defined RDMA HW resources can be defined in rdma_cgroup.h file as RDMA_HW_xx in same enum table. So a driver will change the cgroups file for every new type it adds? Will we just have a super set enum of all crazy types vendors added? Sorry guys, but arbitrary extensibility for something not finished is the worst idea ever. We have two options here: a) delay cgroups support until the grand rewrite is done b) add it now and deal with the consequences later Can we do (b) now and differ adding any HW resources to cgroup until they are clearly called out. Architecture and APIs are already in place to support this. Since this affect the user, it's better to think how it fits our "optional standard"/"vendor types" model first. Maybe we could force all standard types even if the driver we use doesn't support any of them. That being said, adding random non-Verbs hardwasre to the RDMA core is the second worst idea ever. We can differ adding HW resource to core and cgroup until they are clearly defined. In that case current architecture still holds good. Clearly we should differ adding the actual code until a driver could declare such objects, but we need to decide how to expose the standard optional RDMA types (basically, the types you've added) and how future driver specific types fit in. Guess I need to catch up with the discussion and start using the flame thrower. Matan, Can you please point us to the new RFC/ABI email thread which describes the design, partitioning of code between core vs hw drivers etc. One proposal is [1]. There's another one from Sean which aims for similar targets regards the driver specific types. [1] https://www.spinics.net/lists/linux-rdma/msg38997.html Matan
Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
On 07/09/2016 10:55, Parav Pandit wrote: Hi Matan, On Thu, Sep 1, 2016 at 2:14 PM, Christoph Hellwig wrote: On Thu, Sep 01, 2016 at 10:25:40AM +0300, Matan Barak wrote: Well, if I recall, the reason doing so last time was in order to allow flexible updating of ib_core independently, which is obviously not a good reason (to say the least). Since the new ABI will probably define new object types (all recent proposals go this way), the current approach could lead to either trying to map new objects to existing cgroup resource types, which could lead to some weird non 1:1 mapping, or having a split definitions - such that each driver will declare its objects both in the cgroups mechanism and in its driver dispatch table. Even worse than that, drivers could simply ignore the cgroups support while implementing their own resource types and get a very broken containers support. If drivers are broken due to ignorance of not-calling cgroup APIs, that should be ok. That particular driver should fix it. If the resource creation using uverbs is using well defined rdma level resource, uverbs level will make sure to honor cgroup limits without involving hw drivers anyway. All recent proposals of the new ABI schema deals with extending the flexibility of the current schema by letting drivers define their specific types, actions, attributes, etc. Even more than that, the dispatching starts from the driver and it chooses if it wants to use the common RDMA core layer or have it's own wise implementation instead. Some drivers might even prefer not to implement the current verbs types. These decisions were made in the OFVWG meetings. Anyway, maybe we should consider using a more higher-level logic objects that could fit multiple drivers requirements. RDMA Verb level resource is charged/uncharged by RDMA core. RDMA HW level resource is charged/uncharged by RDMA HW driver using well defined API and resource by cgroup core. This scheme ensures that there is 1:1 mapping. Sounds reasonable, but what about drivers which ignore the common code and implement it in their own way? What about drivers which don't support the standard RDMA types at all? I guess we should find a balance between something abstract and common enough that will ease administrator configuration but be specific enough for the various models we have (or will have) in the RDMA stack. I don't think current definition of resource type is carved out on stone. They can be extended as we forward. As many of us agree that, they should be well defined and it should be agreed by cgroup and rdma community. Of course, but since the ABI and cgroups model are somehow related, they should be dealt with together and approved by Doug who participated in some of the OFVWG meetings. For example, today we have RDMA_VERB_xxx resources. New well defined RDMA HW resources can be defined in rdma_cgroup.h file as RDMA_HW_xx in same enum table. So a driver will change the cgroups file for every new type it adds? Will we just have a super set enum of all crazy types vendors added? Sorry guys, but arbitrary extensibility for something not finished is the worst idea ever. We have two options here: a) delay cgroups support until the grand rewrite is done b) add it now and deal with the consequences later Can we do (b) now and differ adding any HW resources to cgroup until they are clearly called out. Architecture and APIs are already in place to support this. Since this affect the user, it's better to think how it fits our "optional standard"/"vendor types" model first. Maybe we could force all standard types even if the driver we use doesn't support any of them. That being said, adding random non-Verbs hardwasre to the RDMA core is the second worst idea ever. We can differ adding HW resource to core and cgroup until they are clearly defined. In that case current architecture still holds good. Clearly we should differ adding the actual code until a driver could declare such objects, but we need to decide how to expose the standard optional RDMA types (basically, the types you've added) and how future driver specific types fit in. Guess I need to catch up with the discussion and start using the flame thrower. Matan, Can you please point us to the new RFC/ABI email thread which describes the design, partitioning of code between core vs hw drivers etc. One proposal is [1]. There's another one from Sean which aims for similar targets regards the driver specific types. [1] https://www.spinics.net/lists/linux-rdma/msg38997.html Matan
Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
On 01/09/2016 00:16, Tejun Heo wrote: Hello, On Wed, Aug 31, 2016 at 06:07:30PM +0300, Matan Barak wrote: Currently, there are some discussions regarding the RDMA ABI. The current proposed approach (after a lot of discussions in the OFVWG) is to have driver dependent object types rather than the fixed set of IB object types we have today. AFAIK, some vendors might want to use the RDMA subsystem for a different fabrics which has a different set of objects. You could see RFCs for such concepts both from Mellanox and Intel on the linux-rdma mailing list. Saying that, maybe we need to make the resource types a bit more flexible and dynamic. That'd be back to square one and Christoph was dead against it too, so... Well, if I recall, the reason doing so last time was in order to allow flexible updating of ib_core independently, which is obviously not a good reason (to say the least). Since the new ABI will probably define new object types (all recent proposals go this way), the current approach could lead to either trying to map new objects to existing cgroup resource types, which could lead to some weird non 1:1 mapping, or having a split definitions - such that each driver will declare its objects both in the cgroups mechanism and in its driver dispatch table. Even worse than that, drivers could simply ignore the cgroups support while implementing their own resource types and get a very broken containers support. Thanks. Matan
Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
On 01/09/2016 00:16, Tejun Heo wrote: Hello, On Wed, Aug 31, 2016 at 06:07:30PM +0300, Matan Barak wrote: Currently, there are some discussions regarding the RDMA ABI. The current proposed approach (after a lot of discussions in the OFVWG) is to have driver dependent object types rather than the fixed set of IB object types we have today. AFAIK, some vendors might want to use the RDMA subsystem for a different fabrics which has a different set of objects. You could see RFCs for such concepts both from Mellanox and Intel on the linux-rdma mailing list. Saying that, maybe we need to make the resource types a bit more flexible and dynamic. That'd be back to square one and Christoph was dead against it too, so... Well, if I recall, the reason doing so last time was in order to allow flexible updating of ib_core independently, which is obviously not a good reason (to say the least). Since the new ABI will probably define new object types (all recent proposals go this way), the current approach could lead to either trying to map new objects to existing cgroup resource types, which could lead to some weird non 1:1 mapping, or having a split definitions - such that each driver will declare its objects both in the cgroups mechanism and in its driver dispatch table. Even worse than that, drivers could simply ignore the cgroups support while implementing their own resource types and get a very broken containers support. Thanks. Matan
Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
On 31/08/2016 11:37, Parav Pandit wrote: Added rdma cgroup controller that does accounting, limit enforcement on rdma/IB verbs and hw resources. Added rdma cgroup header file which defines its APIs to perform charing/uncharing functionality. It also defined APIs for RDMA/IB stack for device registration. Devices which are registered will participate in controller functions of accounting and limit enforcements. It define rdmacg_device structure to bind IB stack and RDMA cgroup controller. RDMA resources are tracked using resource pool. Resource pool is per device, per cgroup entity which allows setting up accounting limits on per device basis. Currently resources are defined by the RDMA cgroup. Resource pool is created/destroyed dynamically whenever charging/uncharging occurs respectively and whenever user configuration is done. Its a tradeoff of memory vs little more code space that creates resource pool object whenever necessary, instead of creating them during cgroup creation and device registration time. Signed-off-by: Parav Pandit--- include/linux/cgroup_rdma.h | 66 + include/linux/cgroup_subsys.h | 4 + init/Kconfig | 10 + kernel/Makefile | 1 + kernel/cgroup_rdma.c | 664 ++ 5 files changed, 745 insertions(+) create mode 100644 include/linux/cgroup_rdma.h create mode 100644 kernel/cgroup_rdma.c diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h new file mode 100644 index 000..6710e28 --- /dev/null +++ b/include/linux/cgroup_rdma.h @@ -0,0 +1,66 @@ +/* + * Copyright (C) 2016 Parav Pandit + * + * This file is subject to the terms and conditions of version 2 of the GNU + * General Public License. See the file COPYING in the main directory of the + * Linux distribution for more details. + */ + +#ifndef _CGROUP_RDMA_H +#define _CGROUP_RDMA_H + +#include + +enum rdmacg_resource_type { + RDMACG_VERB_RESOURCE_UCTX, + RDMACG_VERB_RESOURCE_AH, + RDMACG_VERB_RESOURCE_PD, + RDMACG_VERB_RESOURCE_CQ, + RDMACG_VERB_RESOURCE_MR, + RDMACG_VERB_RESOURCE_MW, + RDMACG_VERB_RESOURCE_SRQ, + RDMACG_VERB_RESOURCE_QP, + RDMACG_VERB_RESOURCE_FLOW, + /* +* add any hw specific resource here as RDMA_HW_RESOURCE_NAME +*/ + RDMACG_RESOURCE_MAX, +}; + +#ifdef CONFIG_CGROUP_RDMA + Currently, there are some discussions regarding the RDMA ABI. The current proposed approach (after a lot of discussions in the OFVWG) is to have driver dependent object types rather than the fixed set of IB object types we have today. AFAIK, some vendors might want to use the RDMA subsystem for a different fabrics which has a different set of objects. You could see RFCs for such concepts both from Mellanox and Intel on the linux-rdma mailing list. Saying that, maybe we need to make the resource types a bit more flexible and dynamic. Regards, Matan
Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
On 31/08/2016 11:37, Parav Pandit wrote: Added rdma cgroup controller that does accounting, limit enforcement on rdma/IB verbs and hw resources. Added rdma cgroup header file which defines its APIs to perform charing/uncharing functionality. It also defined APIs for RDMA/IB stack for device registration. Devices which are registered will participate in controller functions of accounting and limit enforcements. It define rdmacg_device structure to bind IB stack and RDMA cgroup controller. RDMA resources are tracked using resource pool. Resource pool is per device, per cgroup entity which allows setting up accounting limits on per device basis. Currently resources are defined by the RDMA cgroup. Resource pool is created/destroyed dynamically whenever charging/uncharging occurs respectively and whenever user configuration is done. Its a tradeoff of memory vs little more code space that creates resource pool object whenever necessary, instead of creating them during cgroup creation and device registration time. Signed-off-by: Parav Pandit --- include/linux/cgroup_rdma.h | 66 + include/linux/cgroup_subsys.h | 4 + init/Kconfig | 10 + kernel/Makefile | 1 + kernel/cgroup_rdma.c | 664 ++ 5 files changed, 745 insertions(+) create mode 100644 include/linux/cgroup_rdma.h create mode 100644 kernel/cgroup_rdma.c diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h new file mode 100644 index 000..6710e28 --- /dev/null +++ b/include/linux/cgroup_rdma.h @@ -0,0 +1,66 @@ +/* + * Copyright (C) 2016 Parav Pandit + * + * This file is subject to the terms and conditions of version 2 of the GNU + * General Public License. See the file COPYING in the main directory of the + * Linux distribution for more details. + */ + +#ifndef _CGROUP_RDMA_H +#define _CGROUP_RDMA_H + +#include + +enum rdmacg_resource_type { + RDMACG_VERB_RESOURCE_UCTX, + RDMACG_VERB_RESOURCE_AH, + RDMACG_VERB_RESOURCE_PD, + RDMACG_VERB_RESOURCE_CQ, + RDMACG_VERB_RESOURCE_MR, + RDMACG_VERB_RESOURCE_MW, + RDMACG_VERB_RESOURCE_SRQ, + RDMACG_VERB_RESOURCE_QP, + RDMACG_VERB_RESOURCE_FLOW, + /* +* add any hw specific resource here as RDMA_HW_RESOURCE_NAME +*/ + RDMACG_RESOURCE_MAX, +}; + +#ifdef CONFIG_CGROUP_RDMA + Currently, there are some discussions regarding the RDMA ABI. The current proposed approach (after a lot of discussions in the OFVWG) is to have driver dependent object types rather than the fixed set of IB object types we have today. AFAIK, some vendors might want to use the RDMA subsystem for a different fabrics which has a different set of objects. You could see RFCs for such concepts both from Mellanox and Intel on the linux-rdma mailing list. Saying that, maybe we need to make the resource types a bit more flexible and dynamic. Regards, Matan
Re: [BUG] mellanox IB driver fails to load on large config
On Tue, Jul 21, 2015 at 5:56 AM, Alex Thorlton wrote: > On Mon, Jul 20, 2015 at 11:28:03AM -0500, Alex Thorlton wrote: >> I've got some time on the large machine later today. I'll give this a >> try then. > > I ran a boot with this patch applied: > > diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h > index 83e80ab..c84aea0 100644 > --- a/include/linux/mlx4/device.h > +++ b/include/linux/mlx4/device.h > @@ -45,7 +45,7 @@ > #include > > #define MAX_MSIX_P_PORT17 > -#define MAX_MSIX 64 > +#define MAX_MSIX 8192 > #define MSIX_LEGACY_SZ 4 > #define MIN_MSIX_P_PORT5 > > I went for a max of 8192, since I was actually booting the machine with > 6144 cores (not 4096) for this run. It doesn't look like this fixed the > problem. I still saw the same errors during boot. > > FWIW, the module does appear to still successfully load: > > 8<--- > # lsmod | grep mlx > mlx4_ib 151552 0 > ib_sa 32768 1 mlx4_ib > ib_mad 49152 2 ib_sa,mlx4_ib > ib_core 102400 3 ib_sa,mlx4_ib,ib_mad > mlx4_core 278528 1 mlx4_ib > --->8 > > If the module loading is good enough, and we should just ignore the > errors, then I'm fine with that. Just wanting to make sure that > everything is behaving correctly. It shouldn't be a problem, as all unused/erroneous EQs get "-1". We'll try to reproduce the problem here, it might take awhile though. Thanks for checking this, Matan > > - Alex > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] mellanox IB driver fails to load on large config
On Tue, Jul 21, 2015 at 5:56 AM, Alex Thorlton athorl...@sgi.com wrote: On Mon, Jul 20, 2015 at 11:28:03AM -0500, Alex Thorlton wrote: I've got some time on the large machine later today. I'll give this a try then. I ran a boot with this patch applied: diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index 83e80ab..c84aea0 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -45,7 +45,7 @@ #include linux/timecounter.h #define MAX_MSIX_P_PORT17 -#define MAX_MSIX 64 +#define MAX_MSIX 8192 #define MSIX_LEGACY_SZ 4 #define MIN_MSIX_P_PORT5 I went for a max of 8192, since I was actually booting the machine with 6144 cores (not 4096) for this run. It doesn't look like this fixed the problem. I still saw the same errors during boot. FWIW, the module does appear to still successfully load: 8--- # lsmod | grep mlx mlx4_ib 151552 0 ib_sa 32768 1 mlx4_ib ib_mad 49152 2 ib_sa,mlx4_ib ib_core 102400 3 ib_sa,mlx4_ib,ib_mad mlx4_core 278528 1 mlx4_ib ---8 If the module loading is good enough, and we should just ignore the errors, then I'm fine with that. Just wanting to make sure that everything is behaving correctly. It shouldn't be a problem, as all unused/erroneous EQs get -1. We'll try to reproduce the problem here, it might take awhile though. Thanks for checking this, Matan - Alex -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] mellanox IB driver fails to load on large config
On 7/14/2015 11:28 PM, Alex Thorlton wrote: On Tue, Jul 14, 2015 at 11:06:26PM +0300, Or Gerlitz wrote: On Tue, Jul 14, 2015 at 9:48 PM, Alex Thorlton wrote: On Tue, Jul 14, 2015 at 01:22:34PM -0500, andrew banman wrote: On Sat, Jul 11, 2015 at 11:20:19PM +0300, Or Gerlitz wrote: On Fri, Jul 10, 2015 at 10:15 PM, andrew banman wrote: I'm seeing a large number of allocation errors originating from the Mellanox IB driver when booting the 4.2-rc1 kernel on a 4096cpu 32TB memory system: Just to make sure, mlx4 works fine on this small (...) system with 4.1 and 4.2-rc1 breaks, or 4.2-rc1 is the 1st time you're trying that config? I'll let Alex comment on that, he did some testing on that. I started seeing this on a 4.1-rc8 kernel, so it's been around for a little while. It may have been around before 4.1-rc8, but I haven't run any kernels older than that on the big machine for some time. To make sure I am correctly following, on 4.1-rc8 you also see something, right? Yes, that's correct. are these the same messages or different ones? if the latter send to us. We see the same exact messages on 4.1-rc8. Hi, We don't recall getting those error with 32cpu machines, but we'll try to reproduce this issue. Matan Thanks for looking into this! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] mellanox IB driver fails to load on large config
On 7/14/2015 11:28 PM, Alex Thorlton wrote: On Tue, Jul 14, 2015 at 11:06:26PM +0300, Or Gerlitz wrote: On Tue, Jul 14, 2015 at 9:48 PM, Alex Thorlton athorl...@sgi.com wrote: On Tue, Jul 14, 2015 at 01:22:34PM -0500, andrew banman wrote: On Sat, Jul 11, 2015 at 11:20:19PM +0300, Or Gerlitz wrote: On Fri, Jul 10, 2015 at 10:15 PM, andrew banman aban...@sgi.com wrote: I'm seeing a large number of allocation errors originating from the Mellanox IB driver when booting the 4.2-rc1 kernel on a 4096cpu 32TB memory system: Just to make sure, mlx4 works fine on this small (...) system with 4.1 and 4.2-rc1 breaks, or 4.2-rc1 is the 1st time you're trying that config? I'll let Alex comment on that, he did some testing on that. I started seeing this on a 4.1-rc8 kernel, so it's been around for a little while. It may have been around before 4.1-rc8, but I haven't run any kernels older than that on the big machine for some time. To make sure I am correctly following, on 4.1-rc8 you also see something, right? Yes, that's correct. are these the same messages or different ones? if the latter send to us. We see the same exact messages on 4.1-rc8. Hi, We don't recall getting those error with 32cpu machines, but we'll try to reproduce this issue. Matan Thanks for looking into this! - Alex -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/