Re: [for-next 7/7] IB/mlx5: Implement fragmented completion queue (CQ)

2018-02-25 Thread Santosh Shilimkar

n 2/24/2018 1:40 AM, Majd Dibbiny wrote:



On Feb 23, 2018, at 9:13 PM, Saeed Mahameed  wrote:


On Thu, 2018-02-22 at 16:04 -0800, Santosh Shilimkar wrote:
Hi Saeed


On 2/21/2018 12:13 PM, Saeed Mahameed wrote:


[...]



Jason mentioned about this patch to me off-list. We were
seeing similar issue with SRQs & QPs. So wondering whether
you have any plans to do similar change for other resouces
too so that they don't rely on higher order page allocation
for icm tables.



Hi Santosh,

Adding Majd,

Which ULP is in question ? how big are the QPs/SRQs you create that
lead to this problem ?

For icm tables we already allocate only order 0 pages:
see alloc_system_page() in
drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c

But for kernel RDMA SRQ and QP buffers there is a place for
improvement.

Majd, do you know if we have any near future plans for this.


It’s in our plans to move all the buffers to use 0-order pages.

Santosh,

Is this RDS? Do you have persistent failure with some configuration? Can you 
please share more information?


No the issue seen with user verbs and actually MLX4 driver. My
last question was more for both MLX4 and MLX5 drivers icm
allocation for all the resources.

With MLX4 driver, we have seen corruption issues with MLX4_NO_RR
while recycling the issues. So we ended up switching to round robin
bitmap allocation as it was before which was changed by one of
Jacks commit 7c6d74d23 {mlx4_core: Roll back round robin bitmap
allocation commit for CQs, SRQs, and MPTs}

With default round robin, the corruption issue went away but then
its undesired effect of bloating the icm tables till you hit the
resource limit means more memory fragmentation. Since these resources
makes use of higher order allocations and in fragmented memory scenarios
we see contention on mm lock for seconds since compaction layer is
trying to stitch pages which takes time.

If these alloaction don't make use of higher order pages, the issue
can be certainly avoided and hence the reason behind the question.

Ofcourse we wouldn't have ended up with this issue if 'MLX4_NO_RR'
worked without corruption :-)

Regards,
Santosh










Re: [for-next 7/7] IB/mlx5: Implement fragmented completion queue (CQ)

2018-02-24 Thread Majd Dibbiny

> On Feb 23, 2018, at 9:13 PM, Saeed Mahameed  wrote:
> 
>> On Thu, 2018-02-22 at 16:04 -0800, Santosh Shilimkar wrote:
>> Hi Saeed
>> 
>>> On 2/21/2018 12:13 PM, Saeed Mahameed wrote:
>>> From: Yonatan Cohen 
>>> 
>>> The current implementation of create CQ requires contiguous
>>> memory, such requirement is problematic once the memory is
>>> fragmented or the system is low in memory, it causes for
>>> failures in dma_zalloc_coherent().
>>> 
>>> This patch implements new scheme of fragmented CQ to overcome
>>> this issue by introducing new type: 'struct mlx5_frag_buf_ctrl'
>>> to allocate fragmented buffers, rather than contiguous ones.
>>> 
>>> Base the Completion Queues (CQs) on this new fragmented buffer.
>>> 
>>> It fixes following crashes:
>>> kworker/29:0: page allocation failure: order:6, mode:0x80d0
>>> CPU: 29 PID: 8374 Comm: kworker/29:0 Tainted: G OE 3.10.0
>>> Workqueue: ib_cm cm_work_handler [ib_cm]
>>> Call Trace:
>>> [<>] dump_stack+0x19/0x1b
>>> [<>] warn_alloc_failed+0x110/0x180
>>> [<>] __alloc_pages_slowpath+0x6b7/0x725
>>> [<>] __alloc_pages_nodemask+0x405/0x420
>>> [<>] dma_generic_alloc_coherent+0x8f/0x140
>>> [<>] x86_swiotlb_alloc_coherent+0x21/0x50
>>> [<>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core]
>>> [<>] ? mlx5_db_alloc_node+0x69/0x1b0 [mlx5_core]
>>> [<>] mlx5_buf_alloc_node+0x3e/0xa0 [mlx5_core]
>>> [<>] mlx5_buf_alloc+0x14/0x20 [mlx5_core]
>>> [<>] create_cq_kernel+0x90/0x1f0 [mlx5_ib]
>>> [<>] mlx5_ib_create_cq+0x3b0/0x4e0 [mlx5_ib]
>>> 
>>> Signed-off-by: Yonatan Cohen 
>>> Reviewed-by: Tariq Toukan 
>>> Signed-off-by: Leon Romanovsky 
>>> Signed-off-by: Saeed Mahameed 
>>> ---
>> 
>> Jason mentioned about this patch to me off-list. We were
>> seeing similar issue with SRQs & QPs. So wondering whether
>> you have any plans to do similar change for other resouces
>> too so that they don't rely on higher order page allocation
>> for icm tables.
>> 
> 
> Hi Santosh,
> 
> Adding Majd,
> 
> Which ULP is in question ? how big are the QPs/SRQs you create that
> lead to this problem ?
> 
> For icm tables we already allocate only order 0 pages:
> see alloc_system_page() in
> drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
> 
> But for kernel RDMA SRQ and QP buffers there is a place for
> improvement.
> 
> Majd, do you know if we have any near future plans for this.

It’s in our plans to move all the buffers to use 0-order pages.

Santosh,

Is this RDS? Do you have persistent failure with some configuration? Can you 
please share more information?

Thanks
> 
>> Regards,
>> Santosh


Re: [for-next 7/7] IB/mlx5: Implement fragmented completion queue (CQ)

2018-02-23 Thread Saeed Mahameed
On Thu, 2018-02-22 at 16:04 -0800, Santosh Shilimkar wrote:
> Hi Saeed
> 
> On 2/21/2018 12:13 PM, Saeed Mahameed wrote:
> > From: Yonatan Cohen 
> > 
> > The current implementation of create CQ requires contiguous
> > memory, such requirement is problematic once the memory is
> > fragmented or the system is low in memory, it causes for
> > failures in dma_zalloc_coherent().
> > 
> > This patch implements new scheme of fragmented CQ to overcome
> > this issue by introducing new type: 'struct mlx5_frag_buf_ctrl'
> > to allocate fragmented buffers, rather than contiguous ones.
> > 
> > Base the Completion Queues (CQs) on this new fragmented buffer.
> > 
> > It fixes following crashes:
> > kworker/29:0: page allocation failure: order:6, mode:0x80d0
> > CPU: 29 PID: 8374 Comm: kworker/29:0 Tainted: G OE 3.10.0
> > Workqueue: ib_cm cm_work_handler [ib_cm]
> > Call Trace:
> > [<>] dump_stack+0x19/0x1b
> > [<>] warn_alloc_failed+0x110/0x180
> > [<>] __alloc_pages_slowpath+0x6b7/0x725
> > [<>] __alloc_pages_nodemask+0x405/0x420
> > [<>] dma_generic_alloc_coherent+0x8f/0x140
> > [<>] x86_swiotlb_alloc_coherent+0x21/0x50
> > [<>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core]
> > [<>] ? mlx5_db_alloc_node+0x69/0x1b0 [mlx5_core]
> > [<>] mlx5_buf_alloc_node+0x3e/0xa0 [mlx5_core]
> > [<>] mlx5_buf_alloc+0x14/0x20 [mlx5_core]
> > [<>] create_cq_kernel+0x90/0x1f0 [mlx5_ib]
> > [<>] mlx5_ib_create_cq+0x3b0/0x4e0 [mlx5_ib]
> > 
> > Signed-off-by: Yonatan Cohen 
> > Reviewed-by: Tariq Toukan 
> > Signed-off-by: Leon Romanovsky 
> > Signed-off-by: Saeed Mahameed 
> > ---
> 
> Jason mentioned about this patch to me off-list. We were
> seeing similar issue with SRQs & QPs. So wondering whether
> you have any plans to do similar change for other resouces
> too so that they don't rely on higher order page allocation
> for icm tables.
> 

Hi Santosh,

Adding Majd,

Which ULP is in question ? how big are the QPs/SRQs you create that
lead to this problem ?

For icm tables we already allocate only order 0 pages:
see alloc_system_page() in
drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c

But for kernel RDMA SRQ and QP buffers there is a place for
improvement.

Majd, do you know if we have any near future plans for this.

> Regards,
> Santosh

Re: [for-next 7/7] IB/mlx5: Implement fragmented completion queue (CQ)

2018-02-22 Thread Santosh Shilimkar

Hi Saeed

On 2/21/2018 12:13 PM, Saeed Mahameed wrote:

From: Yonatan Cohen 

The current implementation of create CQ requires contiguous
memory, such requirement is problematic once the memory is
fragmented or the system is low in memory, it causes for
failures in dma_zalloc_coherent().

This patch implements new scheme of fragmented CQ to overcome
this issue by introducing new type: 'struct mlx5_frag_buf_ctrl'
to allocate fragmented buffers, rather than contiguous ones.

Base the Completion Queues (CQs) on this new fragmented buffer.

It fixes following crashes:
kworker/29:0: page allocation failure: order:6, mode:0x80d0
CPU: 29 PID: 8374 Comm: kworker/29:0 Tainted: G OE 3.10.0
Workqueue: ib_cm cm_work_handler [ib_cm]
Call Trace:
[<>] dump_stack+0x19/0x1b
[<>] warn_alloc_failed+0x110/0x180
[<>] __alloc_pages_slowpath+0x6b7/0x725
[<>] __alloc_pages_nodemask+0x405/0x420
[<>] dma_generic_alloc_coherent+0x8f/0x140
[<>] x86_swiotlb_alloc_coherent+0x21/0x50
[<>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core]
[<>] ? mlx5_db_alloc_node+0x69/0x1b0 [mlx5_core]
[<>] mlx5_buf_alloc_node+0x3e/0xa0 [mlx5_core]
[<>] mlx5_buf_alloc+0x14/0x20 [mlx5_core]
[<>] create_cq_kernel+0x90/0x1f0 [mlx5_ib]
[<>] mlx5_ib_create_cq+0x3b0/0x4e0 [mlx5_ib]

Signed-off-by: Yonatan Cohen 
Reviewed-by: Tariq Toukan 
Signed-off-by: Leon Romanovsky 
Signed-off-by: Saeed Mahameed 
---

Jason mentioned about this patch to me off-list. We were
seeing similar issue with SRQs & QPs. So wondering whether
you have any plans to do similar change for other resouces
too so that they don't rely on higher order page allocation
for icm tables.

Regards,
Santosh


Re: [for-next 7/7] IB/mlx5: Implement fragmented completion queue (CQ)

2018-02-22 Thread Jason Gunthorpe
On Wed, Feb 21, 2018 at 12:13:54PM -0800, Saeed Mahameed wrote:
> From: Yonatan Cohen 
> 
> The current implementation of create CQ requires contiguous
> memory, such requirement is problematic once the memory is
> fragmented or the system is low in memory, it causes for
> failures in dma_zalloc_coherent().
> 
> This patch implements new scheme of fragmented CQ to overcome
> this issue by introducing new type: 'struct mlx5_frag_buf_ctrl'
> to allocate fragmented buffers, rather than contiguous ones.
> 
> Base the Completion Queues (CQs) on this new fragmented buffer.
> 
> It fixes following crashes:
> kworker/29:0: page allocation failure: order:6, mode:0x80d0
> CPU: 29 PID: 8374 Comm: kworker/29:0 Tainted: G OE 3.10.0
> Workqueue: ib_cm cm_work_handler [ib_cm]
> Call Trace:
> [<>] dump_stack+0x19/0x1b
> [<>] warn_alloc_failed+0x110/0x180
> [<>] __alloc_pages_slowpath+0x6b7/0x725
> [<>] __alloc_pages_nodemask+0x405/0x420
> [<>] dma_generic_alloc_coherent+0x8f/0x140
> [<>] x86_swiotlb_alloc_coherent+0x21/0x50
> [<>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core]
> [<>] ? mlx5_db_alloc_node+0x69/0x1b0 [mlx5_core]
> [<>] mlx5_buf_alloc_node+0x3e/0xa0 [mlx5_core]
> [<>] mlx5_buf_alloc+0x14/0x20 [mlx5_core]
> [<>] create_cq_kernel+0x90/0x1f0 [mlx5_ib]
> [<>] mlx5_ib_create_cq+0x3b0/0x4e0 [mlx5_ib]
> 
> Signed-off-by: Yonatan Cohen 
> Reviewed-by: Tariq Toukan 
> Signed-off-by: Leon Romanovsky 
> Signed-off-by: Saeed Mahameed 
>  drivers/infiniband/hw/mlx5/cq.c | 64 
> +++--
>  drivers/infiniband/hw/mlx5/mlx5_ib.h|  6 +--
>  drivers/net/ethernet/mellanox/mlx5/core/alloc.c | 37 +-
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 11 +++--
>  drivers/net/ethernet/mellanox/mlx5/core/wq.c| 18 +++
>  drivers/net/ethernet/mellanox/mlx5/core/wq.h| 22 +++--
>  include/linux/mlx5/driver.h | 51 ++--
>  7 files changed, 124 insertions(+), 85 deletions(-)

For the drivers/infiniband stuff:

Acked-by: Jason Gunthorpe 

Thanks,
Jason


[for-next 7/7] IB/mlx5: Implement fragmented completion queue (CQ)

2018-02-21 Thread Saeed Mahameed
From: Yonatan Cohen 

The current implementation of create CQ requires contiguous
memory, such requirement is problematic once the memory is
fragmented or the system is low in memory, it causes for
failures in dma_zalloc_coherent().

This patch implements new scheme of fragmented CQ to overcome
this issue by introducing new type: 'struct mlx5_frag_buf_ctrl'
to allocate fragmented buffers, rather than contiguous ones.

Base the Completion Queues (CQs) on this new fragmented buffer.

It fixes following crashes:
kworker/29:0: page allocation failure: order:6, mode:0x80d0
CPU: 29 PID: 8374 Comm: kworker/29:0 Tainted: G OE 3.10.0
Workqueue: ib_cm cm_work_handler [ib_cm]
Call Trace:
[<>] dump_stack+0x19/0x1b
[<>] warn_alloc_failed+0x110/0x180
[<>] __alloc_pages_slowpath+0x6b7/0x725
[<>] __alloc_pages_nodemask+0x405/0x420
[<>] dma_generic_alloc_coherent+0x8f/0x140
[<>] x86_swiotlb_alloc_coherent+0x21/0x50
[<>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core]
[<>] ? mlx5_db_alloc_node+0x69/0x1b0 [mlx5_core]
[<>] mlx5_buf_alloc_node+0x3e/0xa0 [mlx5_core]
[<>] mlx5_buf_alloc+0x14/0x20 [mlx5_core]
[<>] create_cq_kernel+0x90/0x1f0 [mlx5_ib]
[<>] mlx5_ib_create_cq+0x3b0/0x4e0 [mlx5_ib]

Signed-off-by: Yonatan Cohen 
Reviewed-by: Tariq Toukan 
Signed-off-by: Leon Romanovsky 
Signed-off-by: Saeed Mahameed 
---
 drivers/infiniband/hw/mlx5/cq.c | 64 +++--
 drivers/infiniband/hw/mlx5/mlx5_ib.h|  6 +--
 drivers/net/ethernet/mellanox/mlx5/core/alloc.c | 37 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 11 +++--
 drivers/net/ethernet/mellanox/mlx5/core/wq.c| 18 +++
 drivers/net/ethernet/mellanox/mlx5/core/wq.h| 22 +++--
 include/linux/mlx5/driver.h | 51 ++--
 7 files changed, 124 insertions(+), 85 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 5b974fb97611..c4c7b82f4ac1 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -64,14 +64,9 @@ static void mlx5_ib_cq_event(struct mlx5_core_cq *mcq, enum 
mlx5_event type)
}
 }
 
-static void *get_cqe_from_buf(struct mlx5_ib_cq_buf *buf, int n, int size)
-{
-   return mlx5_buf_offset(&buf->buf, n * size);
-}
-
 static void *get_cqe(struct mlx5_ib_cq *cq, int n)
 {
-   return get_cqe_from_buf(&cq->buf, n, cq->mcq.cqe_sz);
+   return mlx5_frag_buf_get_wqe(&cq->buf.fbc, n);
 }
 
 static u8 sw_ownership_bit(int n, int nent)
@@ -403,7 +398,7 @@ static void handle_atomics(struct mlx5_ib_qp *qp, struct 
mlx5_cqe64 *cqe64,
 
 static void free_cq_buf(struct mlx5_ib_dev *dev, struct mlx5_ib_cq_buf *buf)
 {
-   mlx5_buf_free(dev->mdev, &buf->buf);
+   mlx5_frag_buf_free(dev->mdev, &buf->fbc.frag_buf);
 }
 
 static void get_sig_err_item(struct mlx5_sig_err_cqe *cqe,
@@ -724,12 +719,25 @@ int mlx5_ib_arm_cq(struct ib_cq *ibcq, enum 
ib_cq_notify_flags flags)
return ret;
 }
 
-static int alloc_cq_buf(struct mlx5_ib_dev *dev, struct mlx5_ib_cq_buf *buf,
-   int nent, int cqe_size)
+static int alloc_cq_frag_buf(struct mlx5_ib_dev *dev,
+struct mlx5_ib_cq_buf *buf,
+int nent,
+int cqe_size)
 {
+   struct mlx5_frag_buf_ctrl *c = &buf->fbc;
+   struct mlx5_frag_buf *frag_buf = &c->frag_buf;
+   u32 cqc_buff[MLX5_ST_SZ_DW(cqc)] = {0};
int err;
 
-   err = mlx5_buf_alloc(dev->mdev, nent * cqe_size, &buf->buf);
+   MLX5_SET(cqc, cqc_buff, log_cq_size, ilog2(cqe_size));
+   MLX5_SET(cqc, cqc_buff, cqe_sz, (cqe_size == 128) ? 1 : 0);
+
+   mlx5_core_init_cq_frag_buf(&buf->fbc, cqc_buff);
+
+   err = mlx5_frag_buf_alloc_node(dev->mdev,
+  nent * cqe_size,
+  frag_buf,
+  dev->mdev->priv.numa_node);
if (err)
return err;
 
@@ -862,14 +870,15 @@ static void destroy_cq_user(struct mlx5_ib_cq *cq, struct 
ib_ucontext *context)
ib_umem_release(cq->buf.umem);
 }
 
-static void init_cq_buf(struct mlx5_ib_cq *cq, struct mlx5_ib_cq_buf *buf)
+static void init_cq_frag_buf(struct mlx5_ib_cq *cq,
+struct mlx5_ib_cq_buf *buf)
 {
int i;
void *cqe;
struct mlx5_cqe64 *cqe64;
 
for (i = 0; i < buf->nent; i++) {
-   cqe = get_cqe_from_buf(buf, i, buf->cqe_size);
+   cqe = get_cqe(cq, i);
cqe64 = buf->cqe_size == 64 ? cqe : cqe + 64;
cqe64->op_own = MLX5_CQE_INVALID << 4;
}
@@ -891,14 +900,15 @@ static int create_cq_kernel(struct mlx5_ib_dev *dev, 
struct mlx5_ib_cq *cq,
cq->mcq.arm_db = cq->db.db + 1;
cq->mcq.cqe_sz = cqe_size;
 
-   err = alloc_cq_buf(dev, &cq->buf, entries, cqe_size);
+   err = alloc_cq_frag_buf(dev, &cq->buf, entries, cqe_size);
if (e