Yes, I think this is the potential issue of this patch, for each 1M data lustre 
has 256 fragments (256 pages) on 4K pagesize system, which means we can have 
max to (credits X 256) outstanding work requests for each connection, 
decreasing max_send_wr may hit ib_post_send() failure under heavy workload.

I understand this may be a problem for low level stack to allocate big chunk of 
space, and cause memory allocating failures. The solution is enabling 
map_on_demand and use FMR, however, enabling this on some nodes will prevent 
them to join cluster if other nodes have no map_on_demand, we already have a 
patch for this which is pending on review, please check this (LU-3322)

Thanks
Liang

From: David McMillen <[email protected]<mailto:[email protected]>>
Date: Sunday, August 31, 2014 at 6:48 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>, Eli 
Cohen <[email protected]<mailto:[email protected]>>
Subject: Re: [Lustre-discuss] [PATCH] Avoid Lustre failure on temporary failure

Has this been tested with a significant I/O load?  We had tried a similar 
approach but ran into subsequent errors and connection drops when the 
ib_post_send() failed.  The code assumes that the original 
init_qp_attr->cap.max_send_wr value succeeded.  Is there a second part to this 
patch?

Dave

On Sun, Aug 31, 2014 at 2:53 AM, Eli Cohen 
<[email protected]<mailto:[email protected]>> wrote:

> Lustre code tries to create a QP with max_send_wr which depends on a module
> parameter.  The device capabilities do provide the maximum number of send work
> requests that the device supports but the actual number of work requests that
> can be supported in a specific case depends on other characteristics of the
> work queue, the transport type, etc. This is in compliance with the IB spec:
>
> 11.2.1.2 QUERY HCA
> Description:
> Returns the attributes for the specified HCA.
> The maximum values defined in this section are guaranteed
> not-to-exceed values. It is possible for an implementation to allocate
> some HCA resources from the same space. In that case, the maximum
> values returned are not guaranteed for all of those resources
> simultaneously.
>
> This patch tries to decrease the number of requested work requests to a level
> that can be supported by the HCA. This prevents unnecessary failures.
>
> Signed-off-by: Eli Cohen <eli at mellanox.com>
> ---
>  lnet/klnds/o2iblnd/o2iblnd.c | 25 ++++++++++++++++++-------
>  1 file changed, 18 insertions(+), 7 deletions(-)
>
> diff --git a/lnet/klnds/o2iblnd/o2iblnd.c b/lnet/klnds/o2iblnd/o2iblnd.c
> index 4061db00cba2..ef1c6e07cb45 100644
> --- a/lnet/klnds/o2iblnd/o2iblnd.c
> +++ b/lnet/klnds/o2iblnd/o2iblnd.c
> @@ -736,6 +736,7 @@ kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id 
> *cmid,
>       int                     cpt;
>       int                     rc;
>       int                     i;
> +     int                     orig_wr;
>
>       LASSERT(net != NULL);
>       LASSERT(!in_interrupt());
> @@ -862,13 +863,23 @@ kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id 
> *cmid,
>
>       conn->ibc_sched = sched;
>
> -        rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, init_qp_attr);
> -        if (rc != 0) {
> -                CERROR("Can't create QP: %d, send_wr: %d, recv_wr: %d\n",
> -                       rc, init_qp_attr->cap.max_send_wr,
> -                       init_qp_attr->cap.max_recv_wr);
> -                goto failed_2;
> -        }
> +     orig_wr = init_qp_attr->cap.max_send_wr;
> +     do {
> +             rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, init_qp_attr);
> +             if (!rc || init_qp_attr->cap.max_send_wr < 16)
> +                     break;
> +
> +             init_qp_attr->cap.max_send_wr /= 2;
> +     } while (rc);
> +     if (rc != 0) {
> +             CERROR("Can't create QP: %d, send_wr: %d, recv_wr: %d\n",
> +                    rc, init_qp_attr->cap.max_send_wr,
> +                    init_qp_attr->cap.max_recv_wr);
> +             goto failed_2;
> +     }
> +     if (orig_wr != init_qp_attr->cap.max_send_wr)
> +             pr_info("original send wr %d, created with %d\n",
> +                     orig_wr, init_qp_attr->cap.max_send_wr);
>
>          LIBCFS_FREE(init_qp_attr, sizeof(*init_qp_attr));
>

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to