Re: [PATCH] rdma cm + XRC

2010-08-18 Thread Roland Dreier
   I feel similarly about the XRC domain.  Is there any real reason to
   expose it?  What if we just defined a 1:1 relationship between PDs
   and XRC domains, or between XRC domains and XRC TGT QPs?

  Near as I can tell it serves the same purpose as the PD, to provide
  a form of security within a single process..

No, XRC domains can be shared between different processes -- that's kind
of the point of XRCs.

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] rdma cm + XRC

2010-08-11 Thread Hefty, Sean
 It seems the new API has too many constraints for XRC. There are a couple
 things that don't fit:
 
 - XRC needs a domain, which must be created before creating the QP, but
 after we know
   the device to use. In addition it also needs a file descriptor. The
 application may
   want to use a different fd depending on the device. Currently the domain
 can only
   be created in the middle of rdma_create_ep().

This looks like a gap in the APIs.  There's no easy way to associate the data 
returned by rdma_addrinfo to a specific ibv_device.  Part of the issue is that 
rdma_addrinfo may not have an ai_src_addr.  gurgle...

I agree with Jason that we can still change the newer calls.  In this case, the 
problem isn't limited to XRC.  The user will have issues just trying to specify 
the CQs that should be associated with the QP.  Maybe the 'fix' here is to 
remove rdma_create_qp() from rdma_create_ep() -- which basically replaces that 
API with rdma_create_id2(**id, *res).

 - The server side of the connection also needs an SRQ. It's not obvious
 whether it's
   the application or rdma cm to create that SRQ. And that SRQ number must
 be
   given to the client side, presumably in the private data.

The desired mapping of XRC to the librdmacm isn't clear to me.  For example, 
after 'connecting' is two-way communication possible (setting up INI/TGT pairs 
on both nodes), or is a connection only one-way (setup local INI to remote 
TGT)?  Also, as you point out, how are SRQ values exchanged?  Does private data 
carry one SRQ value, all SRQ values for remote processes, none?

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rdma cm + XRC

2010-08-11 Thread frank zago
On 08/11/2010 05:22 PM, Hefty, Sean wrote:
 - The server side of the connection also needs an SRQ. It's not 
 obvious whether it's the application or rdma cm to create that SRQ.
 And that SRQ number must be given to the client side, presumably in
 the private data.
 
 The desired mapping of XRC to the librdmacm isn't clear to me.  For 
 example, after 'connecting' is two-way communication possible 
 (setting up INI/TGT pairs on both nodes), or is a connection only 
 one-way (setup local INI to remote TGT)?  Also, as you point out, how
 are SRQ values exchanged?  Does private data carry one SRQ value, all
 SRQ values for remote processes, none?

XRC is one way. There is a send QP (which looks like a regular RC QP),
and a receive QP. Logically, the listen side would create an XRC receive QP,
and for each incoming connection would create/attach an SRQ to it. But the
initiator would have to know the SRQ number to be able to send data, and
the easiest way to pass that informations is in the CM REP private data.

I think one way to see is that you connect a QP on one side to an SRQs 
on the other side. In case of XRC, rdma_get_request() could create that SRQ
instead of a QP. 

Frank.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rdma cm + XRC

2010-08-11 Thread Jason Gunthorpe
On Wed, Aug 11, 2010 at 03:22:45PM -0700, Hefty, Sean wrote:
  It seems the new API has too many constraints for XRC. There are a couple
  things that don't fit:
  
  - XRC needs a domain, which must be created before creating the QP, but
  after we know
the device to use. In addition it also needs a file descriptor. The
  application may
want to use a different fd depending on the device. Currently the domain
  can only
be created in the middle of rdma_create_ep().
 
 This looks like a gap in the APIs.  There's no easy way to associate
 the data returned by rdma_addrinfo to a specific ibv_device.  Part of
 the issue is that rdma_addrinfo may not have an ai_src_addr.
 gurgle...

This is why I liked the notion of passing in the pd. This restricts
getaddrinfo to doing something that is compatible with the PD and when
the rdma_cm_id is created and bound it is bound to a device, selected
by getaddrinfo, or the kernel, that is compatible with the given PD.

[** I looked at this for a bit, and I couldn't convince myself the
 current imeplementation doesn't have this gap either. The rdma_cm_id
 is bound to a device based on IP addresses, but it can be bound
 without specifying a PD - so there really is no guarentee that the PD
 you want to use will be compatible with the device the kernel
 selects - I bet this means most RDMA CM using apps will explode if you
 do something like IPoIB bond to HCAs..]

[The other view is that exporting per device domains to userspace
 means the kernel has walked away from its role as HW resource
 virtualizer. Why can't a PD be global and the kernel swap it into
 HW as necessary? Makes much of this API mess instantly disappear.]

Ditto for XRC domains.

I think the flow works best for apps, generally apps are being written
that can handle only one domain - so they should get the domain
through a 0 call to getaddrinfo and then re use that domain in all
future calls for secondary connections.

 I agree with Jason that we can still change the newer calls.  In
 this case, the problem isn't limited to XRC.  The user will have
 issues just trying to specify the CQs that should be associated with
 the QP.  Maybe the 'fix' here is to remove rdma_create_qp() from
 rdma_create_ep() -- which basically replaces that API with
 rdma_create_id2(**id, *res).

Maybe 3 functions, since you already have create_ep:
create_id_ep - takes rdma_addrinfo, allocates PD/XRC, rdma_cm_id
create_qp_ep - takes rdma_addrinfo, allocates QP, CQ, etc
create_ep - just calls both the above. Very simplified
(not sure on the names)

Flow is then:

// First QP
hints = 0;
rdma_getaddrinfo(..,hints,res);
rdma_create_id_ep(id,res)
// id-verbs, id-pd, id-xrcdomain are valid now
rdma_create_qp_ep(id,res,attrs);

// Second QP
hints.pd = first_id-pd;
hints.xrcdomain = first_id-xrcdomain;
rdma_getaddrinfo(...,hints,res);
res-pd/xrcdomain are == first_id
// No pd is allocated
rdma_create_ep(second_id,res,attrs);

How do you keep track of the lifetime of the pd though?

This also cleans up the confusing half-state of the rdma_cm_id with
the legacy API where id-verbs can be 0.

  - The server side of the connection also needs an SRQ. It's not obvious
  whether it's
the application or rdma cm to create that SRQ. And that SRQ number must
  be
given to the client side, presumably in the private data.
 
 The desired mapping of XRC to the librdmacm isn't clear to me.  For
 example, after 'connecting' is two-way communication possible
 (setting up INI/TGT pairs on both nodes), or is a connection only
 one-way (setup local INI to remote TGT)?  Also, as you point out,
 how are SRQ values exchanged?  Does private data carry one SRQ
 value, all SRQ values for remote processes, none?

Well, I think RDMACM should do the minimum above what is defined for
the CM protocol, so for XRC that is a unidirectional connect and it
only creates INI/TGT pairs. The required SRQ(s) will have to be setup
by the user - I expect the typical use would be SRQs shared by
multiple TGT QPs.

It looks to me like the main use model for this is peer-peer, so each
side would establish their send half independently and message routing
would be app specific. This means the CM initiator side should be the
side that has the INI QP and the CM target side should be the side
with TGT - ?

Absent any standards, private data SRQ number exchange is protocol
specific..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] rdma cm + XRC

2010-08-11 Thread Hefty, Sean
 Maybe 3 functions, since you already have create_ep:
 create_id_ep - takes rdma_addrinfo, allocates PD/XRC, rdma_cm_id
 create_qp_ep - takes rdma_addrinfo, allocates QP, CQ, etc
 create_ep - just calls both the above. Very simplified
 (not sure on the names)

This is similar to what I was thinking, except I would just use the existing 
create_qp.

I need to give adding PDs to rdma_addrinfo more thought.  It seems that if 
AF_IB were ever accepted, then device specific addressing could be used, rather 
than relying on mapped values.  As an alternative, we could define a function 
like:

struct ibv_context *rdma_get_device(rdma_addrinfo *res);

Internally, this would just end up doing a lot of the same work that 
create_id_ep mentioned above would do.

 How do you keep track of the lifetime of the pd though?

The librdmacm obtains the list of devices during initialization.  It allocates 
a PD per device, which exist while the library is loaded.  This can be 
optimized to release the PDs when nothing is left that references the devices, 
but that's not done now.  If the user specifies the PD, it's up to them to 
track it.

 Well, I think RDMACM should do the minimum above what is defined for
 the CM protocol, so for XRC that is a unidirectional connect and it
 only creates INI/TGT pairs. The required SRQ(s) will have to be setup
 by the user - I expect the typical use would be SRQs shared by
 multiple TGT QPs.
 
 It looks to me like the main use model for this is peer-peer, so each
 side would establish their send half independently and message routing
 would be app specific. This means the CM initiator side should be the
 side that has the INI QP and the CM target side should be the side
 with TGT - ?

This is why I questioned what the desired behavior should be (from an API 
perspective).  If the main usage model is peer-peer, then the librdmacm _could_ 
allocate and connect XRC INI and TGT QPs as pairs, so that bidirectional 
traffic was possible.  (For example, the rdma_cm could respond to a CM REQ with 
a CM REP and a CM REQ for the return path.)

I'm not saying this wouldn't end up in an implementation mess.  The easiest 
thing to do is just perform a unidirectional connect and leave the SRQs up to 
the user.  XRC just seems hideous to use from an application programmer 
viewpoint, but it seems worth exploring if an app could make use of it without 
significant changes from what they would do for RC QPs.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rdma cm + XRC

2010-08-11 Thread Jason Gunthorpe
On Wed, Aug 11, 2010 at 05:04:00PM -0700, Hefty, Sean wrote:
  Maybe 3 functions, since you already have create_ep:
  create_id_ep - takes rdma_addrinfo, allocates PD/XRC, rdma_cm_id
  create_qp_ep - takes rdma_addrinfo, allocates QP, CQ, etc
  create_ep - just calls both the above. Very simplified
  (not sure on the names)
 
 This is similar to what I was thinking, except I would just use the
 existing create_qp.
 
 I need to give adding PDs to rdma_addrinfo more thought.  It seems
 that if AF_IB were ever accepted, then device specific addressing
 could be used, rather than relying on mapped values.  As an
 alternative, we could define a function like:

Even so, the point of passing it into rdma_getaddrinfo is to restrict
the device rdma_getaddrinfo selects, it doesn't matter that you can get
back to the PD from the addrinfo if that PD doesn't have the resources
you need attached to it. Again, I'm thinking from the app perspective
where juggling multiple PDs isn't really done, and thus multiple
connections on different HCAs are not supported by the app. This model
should be supportable without introducing random failures when
rdma_getaddrinfo returns things that use other devices.

 struct ibv_context *rdma_get_device(rdma_addrinfo *res);
 
 Internally, this would just end up doing a lot of the same work that
 create_id_ep mentioned above would do.

Indeed - so why bother? create_id_ep gets you the ID, bound to a
device with a verbs handle. Follow-up calls can use the convetion that
0 for the PD means 'use the global default PD' otherwise an app can
allocate a new PD, or find an existing one using the verbs handle
provided.

  It looks to me like the main use model for this is peer-peer, so each
  side would establish their send half independently and message routing
  would be app specific. This means the CM initiator side should be the
  side that has the INI QP and the CM target side should be the side
  with TGT - ?
 
 This is why I questioned what the desired behavior should be (from
 an API perspective).  If the main usage model is peer-peer, then the
 librdmacm _could_ allocate and connect XRC INI and TGT QPs as pairs,
 so that bidirectional traffic was possible.  (For example, the
 rdma_cm could respond to a CM REQ with a CM REP and a CM REQ for the
 return path.)

If the model is peer-peer having half duplex connections is ideal -
peer-peer would mean that either side could start setting things up at
any time, dealing with the inherent races is a huge messy
problem. Having each side control when its half of the connection
starts up cleans things up tremendously.

 I'm not saying this wouldn't end up in an implementation mess.  The
 easiest thing to do is just perform a unidirectional connect and
 leave the SRQs up to the user.  XRC just seems hideous to use from
 an application programmer viewpoint, but it seems worth exploring if
 an app could make use of it without significant changes from what
 they would do for RC QPs.

Well.. I admit I can't think of many uses for XRC - but I can think of
two reasonable approached for non peer-peer apps:
 - Create a RC QP and build the XRC QPs by exchanging messages within
   the RC QP - avoid the CM protocol entirely. XRC would then be a
   secondary channel to the RC channel, probably for buffer size sorting
   purposes
 - Create the TGT QP on the initiator side and pass its info in the
   private message and do a double QP setup. More complex - does
   rdmacm provide hooks to do this?

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] rdma cm + XRC

2010-08-11 Thread Hefty, Sean
 Even so, the point of passing it into rdma_getaddrinfo is to restrict
 the device rdma_getaddrinfo selects, it doesn't matter that you can get
 back to the PD from the addrinfo if that PD doesn't have the resources
 you need attached to it. Again, I'm thinking from the app perspective
 where juggling multiple PDs isn't really done, and thus multiple
 connections on different HCAs are not supported by the app. This model
 should be supportable without introducing random failures when
 rdma_getaddrinfo returns things that use other devices.

But why select the PD as the restriction, rather than just the device?

If rdma_getaddrinfo calls into a service to obtain some of its information, 
then at some point an address must be sufficient.  What about just passing in a 
device guid as the ai_src_addr?

or, I guess we could add the device to rdma_getaddrinfo:

rdma_getaddrinfo(struct ibv_context *verbs, char *node, char *service,
struct rdma_addrinfo *hints, struct rdma_addrinfo **res);

  - Create the TGT QP on the initiator side and pass its info in the
private message and do a double QP setup. More complex - does
rdmacm provide hooks to do this?

I thought about this as well.  The librdmacm could steal more of the private 
data for this - just giving less to the user when connecting using XRC.  Right 
now, the rdma_cm doesn't provide anything to help connect XRC QPs, so we can do 
what makes the most sense.  I still don't know how the apps exchange SRQs, 
which is really part of the entire connection process...

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rdma cm + XRC

2010-08-11 Thread Jason Gunthorpe
On Wed, Aug 11, 2010 at 08:30:53PM -0700, Hefty, Sean wrote:
  Even so, the point of passing it into rdma_getaddrinfo is to restrict
  the device rdma_getaddrinfo selects, it doesn't matter that you can get
  back to the PD from the addrinfo if that PD doesn't have the resources
  you need attached to it. Again, I'm thinking from the app perspective
  where juggling multiple PDs isn't really done, and thus multiple
  connections on different HCAs are not supported by the app. This model
  should be supportable without introducing random failures when
  rdma_getaddrinfo returns things that use other devices.
 
 But why select the PD as the restriction, rather than just the device?

Well.. That is ok too, but I feel the PD is more general, for instance
with soft iwarp/iboe there is no reason the PD is tied to a single
device - devices are tied to netdevs, but the PD can be shared between
them all.

There are some missing APIs here and there that muddle this (ie there
is no API to set the device a QP is associated with), but I'd rather
not see more added :)

 If rdma_getaddrinfo calls into a service to obtain some of its
 information, then at some point an address must be sufficient.  What
 about just passing in a device guid as the ai_src_addr?

Again, the PD is more general..

There are lots of things rdma_getaddrinfo can support, mapping to a
guid seems restrictive - not sure how guid mapping works with
iwarp/etc when you consider vlans, for instance.

 or, I guess we could add the device to rdma_getaddrinfo:

I'd rather see it in hints, but it isn't a bad idea to have something
like that, even if it is in addition to the PD.

   - Create the TGT QP on the initiator side and pass its info in the
 private message and do a double QP setup. More complex - does
 rdmacm provide hooks to do this?
 
 I thought about this as well.  The librdmacm could steal more of the
 private data for this - just giving less to the user when connecting
 using XRC.  Right now, the rdma_cm doesn't provide anything to help
 connect XRC QPs, so we can do what makes the most sense.  I still
 don't know how the apps exchange SRQs, which is really part of the
 entire connection process...

Well, whatever it is, the whole private data thing has to be optional,
because it is really protocol specific.

I assume the typical model for XRC would be demand-started peer-peer.

So.. app A starts, looks for friends, finds none, goes to sleep. app B
starts, find A, setups up a channel from B-A and sends 'hello! Here
are my SRQs..'. app A gets the message, builds a reply, looks at the
destination of app B, finds no existing XRC send channel so it sets
one up, sends 'hello! here are my SRQs', and the ack'ing reply to B's
hello, and preseto, bi-directional communication, peer-peer
initiation.

I'm assuming a single SRQ is exchanged during the CM process, and the
total number of SRQs exceeds the size of the CM private data - but I
could also imagine packing them all in the private data if an app
wanted. I guess the key thing is that only a single SRQ is necessary
to start, and the target side switches to an initiator role to setup
its side of the connection.

It is actually pretty good if peer-peer is your app design.. If you
tried to do the same with duplex channels you'd end up with races
creating duplicate send sides that you'd want to tear down to save
resources.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rdma cm + XRC

2010-08-10 Thread frank zago
Hello Sean,

On 08/09/2010 03:53 PM, Hefty, Sean wrote:
 This allow rdma ucm to establish an XRC connection between two nodes. Most
 of the changes are related to modify_qp since the API is different
 whether the QP is on the send or receive side.
 To create an XRC receive QP, the cap.max_send_wr must be set to 0.
 Conversely, to create the send XRC QP, that attribute must be non-zero.
 
 I need to give XRC support to the librdmacm more thought, but here are at 
 least the initial concerns:
 
 - XRC support upstream (kernel and user space) is still pending.
   (I can start a librdmacm branch for XRC support.)
 - Changes are needed to the kernel rdma_cm.
   We could start submitting patches against Roland's xrc branch for these.
 - Please update to the latest librdmacm tree.
   More specifically, rdma_getaddrinfo should support XRC as well.

The general parameters would be the same as for RC. Should we create a new 
ai_flag ? or a new port space ?
Is it really necessary to support rdma_getaddrinfo, rdma_create_ep and the new 
APIs ?

 In general, I'd like to find a way to add XRC support to the librdmacm that 
 makes things as simple for the user as possible.

Besides the need to correctly set cap.max_send_wr, the user API is unchanged.

New patch attached.
diff --git a/include/rdma/rdma_cma.h b/include/rdma/rdma_cma.h
index d17ef88..d18685b 100644
--- a/include/rdma/rdma_cma.h
+++ b/include/rdma/rdma_cma.h
@@ -125,6 +125,8 @@ struct rdma_cm_id {
 	struct ibv_cq		*send_cq;
 	struct ibv_comp_channel *recv_cq_channel;
 	struct ibv_cq		*recv_cq;
+ 	struct ibv_xrc_domain *xrc_domain;
+ 	uint32_t xrc_rcv_qpn;
 };
 
 enum {
diff --git a/man/rdma_create_qp.3 b/man/rdma_create_qp.3
index 9d2de76..659e033 100644
--- a/man/rdma_create_qp.3
+++ b/man/rdma_create_qp.3
@@ -39,6 +39,10 @@ a send or receive completion queue is not specified, then a CQ will be
 allocated by the rdma_cm for the QP, along with corresponding completion
 channels.  Completion channels and CQ data created by the rdma_cm are
 exposed to the user through the rdma_cm_id structure.
+.P
+To create an XRC receive QP, and in addition to the XRC QP type,
+ibv_qp_init_attr.cap.max_send_wr must be set to 0. Conversely, to
+create the XRC send QP, that attribute must be non-zero.
 .SH SEE ALSO
 rdma_bind_addr(3), rdma_resolve_addr(3), rdma_destroy_qp(3), ibv_create_qp(3),
 ibv_modify_qp(3)
diff --git a/src/cma.c b/src/cma.c
index a4fd574..b4eec77 100755
--- a/src/cma.c
+++ b/src/cma.c
@@ -948,12 +948,29 @@ static int rdma_init_qp_attr(struct rdma_cm_id *id, struct ibv_qp_attr *qp_attr,
 	return 0;
 }
 
+static int rdma_modify_qp(struct rdma_cm_id *id, 
+		  struct ibv_qp_attr *qp_attr,
+		  int qp_attr_mask)
+{
+	int ret;
+
+	if (id-qp)
+		ret = ibv_modify_qp(id-qp, qp_attr, qp_attr_mask);
+	else if (id-xrc_domain)
+		ret = ibv_modify_xrc_rcv_qp(id-xrc_domain, id-xrc_rcv_qpn,
+	qp_attr, qp_attr_mask);
+	else 
+		ret = EINVAL;
+
+	return ret;
+}
+
 static int ucma_modify_qp_rtr(struct rdma_cm_id *id, uint8_t resp_res)
 {
 	struct ibv_qp_attr qp_attr;
 	int qp_attr_mask, ret;
 
-	if (!id-qp)
+	if (!id-qp  !id-xrc_domain)
 		return ERR(EINVAL);
 
 	/* Need to update QP attributes from default values. */
@@ -962,7 +979,7 @@ static int ucma_modify_qp_rtr(struct rdma_cm_id *id, uint8_t resp_res)
 	if (ret)
 		return ret;
 
-	ret = ibv_modify_qp(id-qp, qp_attr, qp_attr_mask);
+	ret = rdma_modify_qp(id, qp_attr, qp_attr_mask);
 	if (ret)
 		return ERR(ret);
 
@@ -973,7 +990,7 @@ static int ucma_modify_qp_rtr(struct rdma_cm_id *id, uint8_t resp_res)
 
 	if (resp_res != RDMA_MAX_RESP_RES)
 		qp_attr.max_dest_rd_atomic = resp_res;
-	return rdma_seterrno(ibv_modify_qp(id-qp, qp_attr, qp_attr_mask));
+	return rdma_seterrno(rdma_modify_qp(id, qp_attr, qp_attr_mask));
 }
 
 static int ucma_modify_qp_rts(struct rdma_cm_id *id, uint8_t init_depth)
@@ -988,29 +1005,29 @@ static int ucma_modify_qp_rts(struct rdma_cm_id *id, uint8_t init_depth)
 
 	if (init_depth != RDMA_MAX_INIT_DEPTH)
 		qp_attr.max_rd_atomic = init_depth;
-	return rdma_seterrno(ibv_modify_qp(id-qp, qp_attr, qp_attr_mask));
+	return rdma_seterrno(rdma_modify_qp(id, qp_attr, qp_attr_mask));
 }
 
 static int ucma_modify_qp_sqd(struct rdma_cm_id *id)
 {
 	struct ibv_qp_attr qp_attr;
 
-	if (!id-qp)
+	if (!id-qp  !id-xrc_domain)
 		return 0;
 
 	qp_attr.qp_state = IBV_QPS_SQD;
-	return rdma_seterrno(ibv_modify_qp(id-qp, qp_attr, IBV_QP_STATE));
+	return rdma_seterrno(rdma_modify_qp(id, qp_attr, IBV_QP_STATE));
 }
 
 static int ucma_modify_qp_err(struct rdma_cm_id *id)
 {
 	struct ibv_qp_attr qp_attr;
 
-	if (!id-qp)
+	if (!id-qp  !id-xrc_domain)
 		return 0;
 
 	qp_attr.qp_state = IBV_QPS_ERR;
-	return rdma_seterrno(ibv_modify_qp(id-qp, qp_attr, IBV_QP_STATE));
+	return rdma_seterrno(rdma_modify_qp(id, qp_attr, IBV_QP_STATE));
 }
 
 static int ucma_find_pkey(struct cma_device *cma_dev, uint8_t port_num,
@@ -1029,7 +1046,7 @@ static int ucma_find_pkey(struct cma_device *cma_dev, uint8_t 

RE: [PATCH] rdma cm + XRC

2010-08-10 Thread Hefty, Sean
  - XRC support upstream (kernel and user space) is still pending.
(I can start a librdmacm branch for XRC support.)
  - Changes are needed to the kernel rdma_cm.
We could start submitting patches against Roland's xrc branch for
 these.
  - Please update to the latest librdmacm tree.
More specifically, rdma_getaddrinfo should support XRC as well.
 
 The general parameters would be the same as for RC. Should we create a new
 ai_flag ? or a new port space ?

There's a ai_qp_type field available.  I think the RDMA TCP port space would 
work.

 Is it really necessary to support rdma_getaddrinfo, rdma_create_ep and the
 new APIs ?

I think so, yes.  At least XRC needs to be handled, even if some of the calls 
just fail as unsupported.

  In general, I'd like to find a way to add XRC support to the librdmacm
 that makes things as simple for the user as possible.
 
 Besides the need to correctly set cap.max_send_wr, the user API is
 unchanged.

I'm also concerned that the kernel must format the IB CM messages correctly 
when XRC is in use.  The kernel currently formats the messages assuming that 
the QP is RC.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rdma cm + XRC

2010-08-10 Thread Jason Gunthorpe
On Tue, Aug 10, 2010 at 09:59:50AM -0700, Hefty, Sean wrote:

  The general parameters would be the same as for RC. Should we create a new
  ai_flag ? or a new port space ?
 
 There's a ai_qp_type field available.  I think the RDMA TCP port
 space would work.

Not sure the port space matters at all?

Is there anything additional CM information for XRC other than
requesting an XRC QP type? (XRCSRQ or something?)

  Is it really necessary to support rdma_getaddrinfo, rdma_create_ep and the
  new APIs ?
 
 I think so, yes.  At least XRC needs to be handled, even if some of
 the calls just fail as unsupported.

I'd like to see a strong rational for leaving any of the new API
unsupported for XRC - IMHO it should all be doable. The new API is
supposed to be simplifying, we want people to use it..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] rdma cm + XRC

2010-08-10 Thread Hefty, Sean
  There's a ai_qp_type field available.  I think the RDMA TCP port
  space would work.
 
 Not sure the port space matters at all?
 
 Is there anything additional CM information for XRC other than
 requesting an XRC QP type? (XRCSRQ or something?)

It's nothing huge:

Modifications to Table 99:
* Responder Resources field, Values column: 0 for XRC
* Transport Service Type field, Description column: See
  Section 14.6.3.1 Transport Service Type.
* Retry Count field, Values column: 0 for XRC
* RNR Retry Count field, Values column: 0 for XRC
* SRQ field, Values column: 0 for XRC
* Change (reserved) field at position byte 51, bit 5 to:
  * Field: Extended Transport Type
  * Description: See Section 14.6.3.2 Extended Transport Type
  * Used for Purpose: C
  * Byte [Bit] Offset: 51 [5]
  * Length, Bits: 3

Modifications to Table 103:
* Initiator Depth field, Values column: 0 for XRC
* End-to-End Flow Control field, Values column: 0 for XRC
* SRQ field, Values column: 1 for XRC

But should still be set correctly.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rdma cm + XRC

2010-08-10 Thread frank zago
On 08/10/2010 12:14 PM, Jason Gunthorpe wrote:
 On Tue, Aug 10, 2010 at 09:59:50AM -0700, Hefty, Sean wrote:
 
 The general parameters would be the same as for RC. Should we create a new
 ai_flag ? or a new port space ?

 There's a ai_qp_type field available.  I think the RDMA TCP port
 space would work.
 
 Not sure the port space matters at all?
 
 Is there anything additional CM information for XRC other than
 requesting an XRC QP type? (XRCSRQ or something?)

Creating a send or receive XRC QP is using a different API (ibv_create_qp vs
ibv_create_xrc_rcv_qp) so I used the max_send_wr capability attribute to
discriminate between both cases. That's the only visible change to the API.
On 7/30, I posted a patch to perftest rdmb_bw to show how it's used.

 
 Is it really necessary to support rdma_getaddrinfo, rdma_create_ep and the
 new APIs ?

 I think so, yes.  At least XRC needs to be handled, even if some of
 the calls just fail as unsupported.
 
 I'd like to see a strong rational for leaving any of the new API
 unsupported for XRC - IMHO it should all be doable. The new API is
 supposed to be simplifying, we want people to use it..

No rational, besides that someone has to write some code :)
 
Regards,
  Frank.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rdma cm + XRC

2010-08-10 Thread frank zago
On 08/10/2010 12:14 PM, Jason Gunthorpe wrote:
 On Tue, Aug 10, 2010 at 09:59:50AM -0700, Hefty, Sean wrote:
 
 The general parameters would be the same as for RC. Should we create a new
 ai_flag ? or a new port space ?

 There's a ai_qp_type field available.  I think the RDMA TCP port
 space would work.
 
 Not sure the port space matters at all?
 
 Is there anything additional CM information for XRC other than
 requesting an XRC QP type? (XRCSRQ or something?)
 
 Is it really necessary to support rdma_getaddrinfo, rdma_create_ep and the
 new APIs ?

 I think so, yes.  At least XRC needs to be handled, even if some of
 the calls just fail as unsupported.
 
 I'd like to see a strong rational for leaving any of the new API
 unsupported for XRC - IMHO it should all be doable. The new API is
 supposed to be simplifying, we want people to use it..

It seems the new API has too many constraints for XRC. There are a couple 
things that don't fit:

- XRC needs a domain, which must be created before creating the QP, but after 
we know 
  the device to use. In addition it also needs a file descriptor. The 
application may 
  want to use a different fd depending on the device. Currently the domain can 
only 
  be created in the middle of rdma_create_ep().

- The server side of the connection also needs an SRQ. It's not obvious whether 
it's
  the application or rdma cm to create that SRQ. And that SRQ number must be
  given to the client side, presumably in the private data.

Frank.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rdma cm + XRC

2010-08-10 Thread Jason Gunthorpe
On Tue, Aug 10, 2010 at 04:05:42PM -0500, frank zago wrote:

 It seems the new API has too many constraints for XRC. There are a
 couple things that don't fit:

I'll try to take a more careful look at this later, but just want to
say that the new APIs are so new that we could still change them - not
supporting XRC seems like a API design failing that might haunt us
later?

Keep in mind that rdma_getaddrinfo is the scheme to be used for CM
address resolution scalability, so it seems natural that anyone using
RDMA CM and XRC would want to use both together.

 - XRC needs a domain, which must be created before creating the QP,
 but after we know the device to use. In addition it also needs a
 file descriptor. The application may want to use a different fd
 depending on the device. Currently the domain can only be created in
 the middle of rdma_create_ep().

Well.. the XRC domain needs to be an input to create_ep just like
the PD :(

In looking at how this API turned out maybe the PD should have been
carried in the rdma_addrinfo? Certainly I would put the XRC domain
in there.. Recall my original comments about the PD being used to
restrict device selection in rdma_getaddrinfo.

Not sure what the FD is about (been awhile since the libibverbs XRC
patches were posted)?

 - The server side of the connection also needs an SRQ. It's not
 obvious whether it's the application or rdma cm to create that
 SRQ. And that SRQ number must be given to the client side,
 presumably in the private data.

All rdma_create_ep could do is setup the XRC INI QP and XRC TGT QP,
the XRC SRQ is associated with the XRC domain so it has to be managed
by the app.

The private data is passed into rdma_connect, after create_ep, so
there seems to be no problem there, the app can format up the SRQ
number according to the CM private data protocol it is using..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] rdma cm + XRC

2010-08-10 Thread Hefty, Sean
 Well.. the XRC domain needs to be an input to create_ep just like
 the PD :(
 
 In looking at how this API turned out maybe the PD should have been
 carried in the rdma_addrinfo? Certainly I would put the XRC domain
 in there.. Recall my original comments about the PD being used to
 restrict device selection in rdma_getaddrinfo.

Personally, if I had it to do over, I don't know that I would have exposed the 
PD through the librdmacm at all.  Does anyone ever allocate more than one per 
device?

I feel similarly about the XRC domain.  Is there any real reason to expose it?  
What if we just defined a 1:1 relationship between PDs and XRC domains, or 
between XRC domains and XRC TGT QPs?

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rdma cm + XRC

2010-08-10 Thread Jason Gunthorpe
On Tue, Aug 10, 2010 at 04:18:57PM -0700, Hefty, Sean wrote:
  Well.. the XRC domain needs to be an input to create_ep just like
  the PD :(
  
  In looking at how this API turned out maybe the PD should have been
  carried in the rdma_addrinfo? Certainly I would put the XRC domain
  in there.. Recall my original comments about the PD being used to
  restrict device selection in rdma_getaddrinfo.
 
 Personally, if I had it to do over, I don't know that I would have
 exposed the PD through the librdmacm at all.  Does anyone ever
 allocate more than one per device?

That isn't a half bad idea I guess, and if it was in rdma_addinfo it
could be 0 = use global PD/XRC and 99% of apps can just do that..

Maybe you should just go ahead and do that? It would not be hard to
do so and even keep ABI (but not API) compatability. Now would be the
time :)

 I feel similarly about the XRC domain.  Is there any real reason to
 expose it?  What if we just defined a 1:1 relationship between PDs
 and XRC domains, or between XRC domains and XRC TGT QPs?

Near as I can tell it serves the same purpose as the PD, to provide
a form of security within a single process..

I could imagine things like storage apps wanting to use them -
MR's/etc need to be carefully compartmentalized if you do not entirely
trust your peer.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] rdma cm + XRC

2010-08-09 Thread Hefty, Sean
 This allow rdma ucm to establish an XRC connection between two nodes. Most
 of the changes are related to modify_qp since the API is different
 whether the QP is on the send or receive side.
 To create an XRC receive QP, the cap.max_send_wr must be set to 0.
 Conversely, to create the send XRC QP, that attribute must be non-zero.

I need to give XRC support to the librdmacm more thought, but here are at least 
the initial concerns:

- XRC support upstream (kernel and user space) is still pending.
  (I can start a librdmacm branch for XRC support.)
- Changes are needed to the kernel rdma_cm.
  We could start submitting patches against Roland's xrc branch for these.
- Please update to the latest librdmacm tree.
  More specifically, rdma_getaddrinfo should support XRC as well.

In general, I'd like to find a way to add XRC support to the librdmacm that 
makes things as simple for the user as possible.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rdma cm + XRC

2010-08-03 Thread Richard Frank

Hello Frank, Thank you for these patches.. !!

We are working on a user mode IPC based on XRC.. we will try these patches..

frank zago wrote:

Hello,

This allow rdma ucm to establish an XRC connection between two nodes. Most
of the changes are related to modify_qp since the API is different
whether the QP is on the send or receive side.
To create an XRC receive QP, the cap.max_send_wr must be set to 0.
Conversely, to create the send XRC QP, that attribute must be non-zero.

Regards,
  Frank



  

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html