RE: [PATCH v1 for-next 02/16] RDMA/ocrdma: Query and initalize the PFC SL

2014-06-12 Thread Selvin Xavier


> -Original Message-
> From: Or Gerlitz [mailto:ogerl...@mellanox.com]
> Sent: Wednesday, June 11, 2014 3:11 PM
> To: Selvin Xavier; linux-rdma@vger.kernel.org; Subramanian Seetharaman
> Cc: rol...@kernel.org; Devesh Sharma; Sathya Perla; Ajit Khaparde
> Subject: Re: [PATCH v1 for-next 02/16] RDMA/ocrdma: Query and initalize
> the PFC SL
> 
> On 10/06/2014 17:02, Selvin Xavier wrote:
> > This patch implements routine to query the PFC priority from the adapter
> port.
> >
> > Following are the changes implemented:
> >
> >   * A new FW command is implemented to query the operational/admin
> > DCBX configuration from the FW and obtain active priority(service 
> > level).
> >   * Adds support for the async event reported by FW when the PFC priority
> changes.
> 
> +benet maintainers,
> 
> Any reason for all code relating to the above not to land in your Ethernet
> driver? the same FW serves  for plain Ethernet and RoCE, isn't that?
> 
Yes, the same FW serves Ethernet and RoCE.

This patch does not provide mechanism for users to get/set PFC priority for 
RoCE as of now.
FW does DCBx negotiation and accepts settings provided by the switch. 
So RoCE Driver is required to query PFC settings from FW and use it while 
setting up QPs.
This is implemented in this patch.
Support for DCBX parameter reporting to the kernel and implementing DCBX hooks 
may be
added in the benet driver in the future.

> 
> > Service level is re-initialized during modify_qp or
> > create_ah, based on this event.
> >   * Maintain SL value in ocrdma_dev structure and refer that as and
> > when needed.
> >
> > Signed-off-by: Devesh Sharma 
> > Signed-off-by: Selvin Xavier 
> > ---
> >   drivers/infiniband/hw/ocrdma/ocrdma.h  |   21 
> >   drivers/infiniband/hw/ocrdma/ocrdma_ah.c   |2 +
> >   drivers/infiniband/hw/ocrdma/ocrdma_hw.c   |  172
> 
> >   drivers/infiniband/hw/ocrdma/ocrdma_hw.h   |2 +
> >   drivers/infiniband/hw/ocrdma/ocrdma_main.c |1 +
> >   drivers/infiniband/hw/ocrdma/ocrdma_sli.h  |   81 +-
> >   6 files changed, 278 insertions(+), 1 deletions(-)
> >
> > diff --git a/drivers/infiniband/hw/ocrdma/ocrdma.h
> > b/drivers/infiniband/hw/ocrdma/ocrdma.h
> > index 19011db..5cd65c2 100644
> > --- a/drivers/infiniband/hw/ocrdma/ocrdma.h
> > +++ b/drivers/infiniband/hw/ocrdma/ocrdma.h
> > @@ -236,6 +236,9 @@ struct ocrdma_dev {
> > struct rcu_head rcu;
> > int id;
> > u64 stag_arr[OCRDMA_MAX_STAG];
> > +   u8 sl; /* service level */
> > +   bool pfc_state;
> > +   atomic_t update_sl;
> > u16 pvid;
> > u32 asic_id;
> >
> > @@ -518,4 +521,22 @@ static inline u8 ocrdma_get_asic_type(struct
> ocrdma_dev *dev)
> > OCRDMA_SLI_ASIC_GEN_NUM_SHIFT;
> >   }
> >
> > +static inline u8 ocrdma_get_pfc_prio(u8 *pfc, u8 prio) {
> > +   return *(pfc + prio);
> > +}
> > +
> > +static inline u8 ocrdma_get_app_prio(u8 *app_prio, u8 prio) {
> > +   return *(app_prio + prio);
> > +}
> > +
> > +static inline u8 ocrdma_is_enabled_and_synced(u32 state)
> > +{  /* May also be used to interpret TC-state, QCN-state
> > +* Appl-state and Logical-link-state in future.
> > +*/
> > +   return (state & OCRDMA_STATE_FLAG_ENABLED) &&
> > +   (state & OCRDMA_STATE_FLAG_SYNC);
> > +}
> > +
> >   #endif
> > diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
> > b/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
> > index d4cc01f..a023234 100644
> > --- a/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
> > +++ b/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
> > @@ -100,6 +100,8 @@ struct ib_ah *ocrdma_create_ah(struct ib_pd
> *ibpd, struct ib_ah_attr *attr)
> > if (!(attr->ah_flags & IB_AH_GRH))
> > return ERR_PTR(-EINVAL);
> >
> > +   if (atomic_cmpxchg(&dev->update_sl, 1, 0))
> > +   ocrdma_init_service_level(dev);
> > ah = kzalloc(sizeof(*ah), GFP_ATOMIC);
> > if (!ah)
> > return ERR_PTR(-ENOMEM);
> > diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
> > b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
> > index bce4adf..e6463cb 100644
> > --- a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
> > +++ b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
> > @@ -771,6 +771,10 @@ static void ocrdma_process_grp5_aync(struct
> ocrdma_dev *dev,
> >
>   OCRDMA_AE_PVID_MCQE_TAG_MASK) >>
> >
>   OCRDMA_AE_PVID_MCQE_TAG_SHIFT);
> > break;
> > +
> > +   case OCRDMA_ASYNC_EVENT_COS_VALUE:
> > +   atomic_set(&dev->update_sl, 1);
> > +   break;
> > default:
> > /* Not interested evts. */
> > break;
> > @@ -2265,6 +2269,8 @@ static int ocrdma_set_av_params(struct
> ocrdma_qp
> > *qp,
> >
> > if ((ah_attr->ah_flags & IB_AH_GRH) == 0)
> > return -EINVAL;
> > +   if (atomic_cmpxchg(&qp->dev->update_sl, 1, 0))
> > +   ocrdma_init_service_level(qp->dev);
> > cmd->params.tclass_sq_psn |=
> > (ah_attr->grh.traffic_class <<
> OCRDMA_QP_PARAMS_TC

Re: nfs-rdma performance

2014-06-12 Thread Mark Lehrer
I am using ConnectX-3 HCA's and Dell R720 servers.

On Thu, Jun 12, 2014 at 2:00 PM, Steve Wise  wrote:
> On 6/12/2014 2:54 PM, Mark Lehrer wrote:
>>
>> Awesome work on nfs-rdma in the later kernels!  I had been having
>> panic problems for awhile and now things appear to be quite reliable.
>>
>> Now that things are more reliable, I would like to help work on speed
>> issues.  On this same hardware with SMB Direct and the standard
>> storage review 8k 70/30 test, I get combined read & write performance
>> of around 2.5GB/sec.  With nfs-rdma it is pushing about 850MB/sec.
>> This is simply an unacceptable difference.
>>
>> I'm using the standard settings -- connected mode, 65520 byte MTU,
>> nfs-server-side "async", lots of nfsd's, and nfsver=3 with large
>> buffers.  Does anyone have any tuning suggestions and/or places to
>> start looking for bottlenecks?
>
>
> What RDMA device?
>
> Steve.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfs-rdma performance

2014-06-12 Thread Wendy Cheng
On Thu, Jun 12, 2014 at 12:54 PM, Mark Lehrer  wrote:
>
> Awesome work on nfs-rdma in the later kernels!  I had been having
> panic problems for awhile and now things appear to be quite reliable.
>
> Now that things are more reliable, I would like to help work on speed
> issues.  On this same hardware with SMB Direct and the standard
> storage review 8k 70/30 test, I get combined read & write performance
> of around 2.5GB/sec.  With nfs-rdma it is pushing about 850MB/sec.
> This is simply an unacceptable difference.
>
> I'm using the standard settings -- connected mode, 65520 byte MTU,
> nfs-server-side "async", lots of nfsd's, and nfsver=3 with large
> buffers.  Does anyone have any tuning suggestions and/or places to
> start looking for bottlenecks?
>

There is a tunable called "xprt_rdma_slot_table_entries" .. Increasing
that seemed to help a lot for me last year. Be aware that this tunable
is enclosed inside  "#ifdef RPC_DEBUG" so you might need to tweak the
source and rebuild the kmod.


-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfs-rdma performance

2014-06-12 Thread Steve Wise

On 6/12/2014 2:54 PM, Mark Lehrer wrote:

Awesome work on nfs-rdma in the later kernels!  I had been having
panic problems for awhile and now things appear to be quite reliable.

Now that things are more reliable, I would like to help work on speed
issues.  On this same hardware with SMB Direct and the standard
storage review 8k 70/30 test, I get combined read & write performance
of around 2.5GB/sec.  With nfs-rdma it is pushing about 850MB/sec.
This is simply an unacceptable difference.

I'm using the standard settings -- connected mode, 65520 byte MTU,
nfs-server-side "async", lots of nfsd's, and nfsver=3 with large
buffers.  Does anyone have any tuning suggestions and/or places to
start looking for bottlenecks?


What RDMA device?

Steve.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


nfs-rdma performance

2014-06-12 Thread Mark Lehrer
Awesome work on nfs-rdma in the later kernels!  I had been having
panic problems for awhile and now things appear to be quite reliable.

Now that things are more reliable, I would like to help work on speed
issues.  On this same hardware with SMB Direct and the standard
storage review 8k 70/30 test, I get combined read & write performance
of around 2.5GB/sec.  With nfs-rdma it is pushing about 850MB/sec.
This is simply an unacceptable difference.

I'm using the standard settings -- connected mode, 65520 byte MTU,
nfs-server-side "async", lots of nfsd's, and nfsver=3 with large
buffers.  Does anyone have any tuning suggestions and/or places to
start looking for bottlenecks?

Thanks,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Comparison of RDMA read and write performance

2014-06-12 Thread Anuj Kalia
Hi.

While working with RDMA, I have consistently observed that RDMA write
performance is significantly larger than RDMA read performance.

My experimental setup is the following. I use several "client"
machines to issue RDMA operations to one "server" machine, and I
record the total operations/second across the client machines. My goal
is to calculate the RDMA rate supported by the server machine's RDMA
adapter. I use small-sized transfers (around 64 bytes) and reliable
transport (RC).

Here are the approximate numbers (million operations per second) that
I get for two cards.

ConnectX-3, 353A with PCIe 2.0 x8: 20 Mops READ, 26 Mops WRITE
ConnectX-3, 313A (RoCE) with PCIe 2.0 x8: 20 Mops READ, 26 Mops WRITE
ConnectX-3, 354A with PCIe 3.0 x8: 23 Mops READ, 35 Mops WRITE

The WRITE message rate for the PCIe 3.0 system matches with the
advertized 35 million/second message rate for ConnectX-3 cards, but
the READ throughput is significantly smaller.

I have some adhoc explanations for this observation:

1. 
http://pdf.aminer.org/000/344/730/pvfs_over_infiniband_design_and_performance_evaluation.pdf:
this is a decade old paper, but its Figure 5 shows that writes
outperform reads.

2. PCIe write operations (posted) are cheaper than PCIe read
operations (non-posted), so this might help RDMA writes.

3. The number of oustanding RDMA reads on any queue pair is small (at
most 16 for CX-3). This reduces RDMA read performance.

However, none of these is completely convincing. I wonder if someone
else in the community has observed this or has a better explanation.

Although I have tried improving my RDMA-read code in many ways, and my
numbers match the ones in FaRM (
https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf),
I am also interested if someone has observed higher RDMA read
performance with these cards.

The advertized performance of Connect-IB is 137 million RDMA writes
per second. Is there an advertized number for RDMA reads per second?

Thanks and regards,
Anuj Kalia
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libibverbs V4 3/5] Use neighbour lookup for RoCE UD QPs Eth L2 resolution

2014-06-12 Thread Jason Gunthorpe
On Thu, Jun 12, 2014 at 03:44:03PM +0300, Matan Barak wrote:

> We could use a libibverbs API call in order to resolve the IP based
> GID into a MAC, but I think it could cause multiple vendors to have
> some code duplication.

If that is the only objection, I would prefer to see this techique. A
little provider code duplication is a lessor evil than introducing and
vetting new verbs APIs.

I think the patch will be very small and there will be very little to
talk about from an API perspective.

> We hope that in the future, more products will use RoCE with IP
> based GIDs. All those providers will have to supply similar code
> that checks if the link layer is Ethernet and IP based GID is used,
> they'll have to use the libibverbs utility function.

AFAIK all the other RoCEE implementations don't do InfiniBand link
layer, so their providers don't even need the test.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libibverbs V4 3/5] Use neighbour lookup for RoCE UD QPs Eth L2 resolution

2014-06-12 Thread Matan Barak

On 21/5/2014 11:31 PM, Jason Gunthorpe wrote:

On Sun, May 18, 2014 at 12:38:47PM +0300, Or Gerlitz wrote:

  struct ibv_ah *__ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
  {
-   struct ibv_ah *ah = pd->context->ops.create_ah(pd, attr);
+   int err;
+   struct ibv_ah *ah = NULL;
+#ifndef NRESOLVE_NEIGH
+   struct ibv_port_attr port_attr;
+   int dst_family;
+   int src_family;
+   int oif;
+   struct get_neigh_handler neigh_handler;
+   union ibv_gid sgid;
+   struct ibv_ah_attr_ex attr_ex;
+   int ether_len;
+   struct verbs_context *vctx = verbs_get_ctx_op(pd->context,
+ create_ah_ex);
+   struct peer_address src;
+   struct peer_address dst;
+
+   if (!vctx) {
+#endif
+   ah = pd->context->ops.create_ah(pd, attr);
+#ifndef NRESOLVE_NEIGH
+   goto return_ah;
+   }
+
+   err = ibv_query_port(pd->context, attr->port_num, &port_attr);


It feels like a regression to force this overhead. Many HCAs have no
possibility to support anything but IB, or Ethernet and don't need
this.

This whole arrangement seems strange. create_ah_ex should be a full
fledged user callable function, not a buried driver entry point.
It is also very unusual for verbs to have all this generic code in
a driver wrapper.

I suspect the answer here is to have the driver call into helper
functions from verbs to do this addressing
work. 'get_ethernet_l2_from_ah' or something.

If you do that then we don't even need create_ah_ex and query_port_ex
- those function seem to only be required to support this wrapper
technique.

So please rethink how this flows.. Maybe wrapping is not the best
choice??

Jason



We could use a libibverbs API call in order to resolve the IP based GID 
into a MAC, but I think it could cause multiple vendors to have some 
code duplication. We hope that in the future, more products will use 
RoCE with IP based GIDs. All those providers will have to supply similar 
code that checks if the link layer is Ethernet and IP based GID is used, 
they'll have to use the libibverbs utility function.


A possible future create_ah_ex *libibverbs* could be a full fledged verb 
that uses the new provider's create_ah_ex. Currently, it's hidden inside 
the good old create_ah call, but we shouldn't limit ourselves to keep it 
that way.


Anyway, I have no objection to use a utility function instead of this 
method, but I do think that the current code has some advantages.


Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html