RE: [RFC] XRC upstream merge reboot

2011-08-22 Thread Hefty, Sean
I am a bit concerned here. In the current usage model, target QPs are destroyed when their reference count goes to zero (ib_reg_xrc_recv_qp and ibv_xrc_create_qp increment the reference count, while ib_unreg_xrc_recv_qp decrements it). In this model, the TGT QP user/consumer does not

Re: [RFC] XRC upstream merge reboot

2011-08-21 Thread Jack Morgenstein
On Thursday 11 August 2011 01:20, Hefty, Sean wrote: To help with OFED feature level compatibility, I'm in the process of adding a new call to ibverbs: struct ib_qp_open_attr { void (*event_handler)(struct ib_event *, void *); void  *qp_context; u32    qp_num; };

Re: [RFC] XRC upstream merge reboot

2011-08-11 Thread Shamis, Pavel
I think it's good idea to support both usage models. Regards, Pasha. Things only get complicated when the domain-allocator process allocates a single domain and simply uses that single domain for all jobs (i.e., the domain is never de-allocated for the lifetime of the allocating process,

RE: [RFC] XRC upstream merge reboot

2011-08-10 Thread Hefty, Sean
Things only get complicated when the domain-allocator process allocates a single domain and simply uses that single domain for all jobs (i.e., the domain is never de-allocated for the lifetime of the allocating process, and the allocating process is the server for all jobs). To help with

Re: [RFC] XRC upstream merge reboot

2011-08-03 Thread Jack Morgenstein
On Tuesday 02 August 2011 19:29, Shamis, Pavel wrote: XRC domain is created by process that starts first.  All the rest processes, that belong to the same mpi session and reside on the same node, join the domain. TGT QP is created by process that receive inbound connection first and it is

RE: [RFC] XRC upstream merge reboot

2011-08-03 Thread Shamis, Pavel
BTW, did we have the same limitation/feature (only creating process is allowed to modify) in original XRC driver ? I'm not certain about the implementation, but the OFED APIs would allow any process within the xrc domain to modify the qp. Hmm, is it way to destroy the QP, when the

RE: [RFC] XRC upstream merge reboot

2011-08-03 Thread Hefty, Sean
Well, actually I was thinking about APM. If the creator exits, we do not have a way to upload alternative path. Correct - that would be a limitation. You would need to move to a new tgt qp. In a general solution, this involves not only allowing other processes to modify the QP, but also

RE: [RFC] XRC upstream merge reboot

2011-08-03 Thread Shamis, Pavel
Well, actually I was thinking about APM. If the creator exits, we do not have a way to upload alternative path. Correct - that would be a limitation. You would need to move to a new tgt qp. Well, in Open MPI we have XRC code that uses APM. If Mellanox cares about the feature, they

RE: [RFC] XRC upstream merge reboot

2011-08-03 Thread Hefty, Sean
Well, in Open MPI we have XRC code that uses APM. If Mellanox cares about the feature, they would have to rework this part of code in Open MPI. I don't know about other apps. But does the APM implementation expect some other process other than the creator to be able to modify the QP? APM

RE: [RFC] XRC upstream merge reboot

2011-08-03 Thread Shamis, Pavel
Well, in Open MPI we have XRC code that uses APM. If Mellanox cares about the feature, they would have to rework this part of code in Open MPI. I don't know about other apps. But does the APM implementation expect some other process other than the creator to be able to modify the QP?

Re: [RFC] XRC upstream merge reboot

2011-08-03 Thread Jason Gunthorpe
On Wed, Aug 03, 2011 at 05:16:17PM -0400, Shamis, Pavel wrote: Well, in Open MPI we have XRC code that uses APM. If Mellanox cares about the feature, they would have to rework this part of code in Open MPI. I don't know about other apps. But does the APM implementation

RE: [RFC] XRC upstream merge reboot

2011-08-03 Thread Hefty, Sean
Where does the ib_verbs async event for APM state change get routed for XRC? The OFED APIs route QP events to all processes which register for that qp number. Does the event have enough info to identify all the necessary parts? The event carries the qp number only. Can the process that

Re: [RFC] XRC upstream merge reboot

2011-08-03 Thread Jason Gunthorpe
On Thu, Aug 04, 2011 at 12:06:24AM +, Hefty, Sean wrote: Where does the ib_verbs async event for APM state change get routed for XRC? The OFED APIs route QP events to all processes which register for that qp number. ?? How do you register for an event? There is only

RE: [RFC] XRC upstream merge reboot

2011-08-03 Thread Hefty, Sean
?? How do you register for an event? There is only ibv_get_async_event(3) - I thought it returned all events relevant to the associated verbs context. The OFED APIs for managing XRC receive QPs are: int (*create_xrc_rcv_qp)(struct ibv_qp_init_attr *init_attr, uint32_t

Re: [RFC] XRC upstream merge reboot

2011-08-02 Thread Jack Morgenstein
On Monday 01 August 2011 21:28, Hefty, Sean wrote: From Pavel Shamis: We do have unregister on finalization. But this code doesn't introduce any synchronization across processes on the same node, since kernel manages the receive qp. If the reference counter will be moved to app

Re: [RFC] XRC upstream merge reboot

2011-08-02 Thread Shamis, Pavel
Hi Jack, Please see my comments below From Pavel Shamis: We do have unregister on finalization. But this code doesn't introduce any synchronization across processes on the same node, since kernel manages the receive qp. If the reference counter will be moved to app responsibility, it will

Re: [RFC] XRC upstream merge reboot

2011-08-02 Thread Shamis, Pavel
We do have unregister on finalization. But this code doesn't introduce any synchronization across processes on the same node, since kernel manages the receive qp. If the reference counter will be moved to app responsibility, it will enforce the app to mange the reference counter on app level

RE: [RFC] XRC upstream merge reboot

2011-08-02 Thread Hefty, Sean
If the target QP is opened in low level driver, then it's owned by group of processes that share the same XRC domain. Can you define what you mean by 'owned'? With the latest patches, the target qp is created in the kernel. Data received on the target qp can go to any process sharing the

Re: [RFC] XRC upstream merge reboot

2011-08-02 Thread Shamis, Pavel
On Aug 2, 2011, at 5:25 PM, Hefty, Sean wrote: If the target QP is opened in low level driver, then it's owned by group of processes that share the same XRC domain. Can you define what you mean by 'owned'? With the latest patches, the target qp is created in the kernel. Data received

RE: [RFC] XRC upstream merge reboot

2011-08-02 Thread Hefty, Sean
BTW, did we have the same limitation/feature (only creating process is allowed to modify) in original XRC driver ? I'm not certain about the implementation, but the OFED APIs would allow any process within the xrc domain to modify the qp. Hmm, is it way to destroy the QP, when the original

RE: [RFC] XRC upstream merge reboot

2011-08-01 Thread Hefty, Sean
Actually I think it is really not so good idea manage reference counter across OOB communication. But this is exactly what the current API *requires* that users of XRC do!!! And I agree, it's not a good idea. :) IMHO, I don't see a good reason to redefine existing API. I afraid, that such

Re: [RFC] XRC upstream merge reboot

2011-08-01 Thread Shamis, Pavel
Actually I think it is really not so good idea manage reference counter across OOB communication. But this is exactly what the current API *requires* that users of XRC do!!! And I agree, it's not a good idea. :) We do have unregister on finalization. But this code doesn't introduce any

RE: [RFC] XRC upstream merge reboot

2011-08-01 Thread Hefty, Sean
We do have unregister on finalization. But this code doesn't introduce any synchronization across processes on the same node, since kernel manages the receive qp. If the reference counter will be moved to app responsibility, it will enforce the app to mange the reference counter on app level ,

Re: [RFC] XRC upstream merge reboot

2011-07-26 Thread Shamis, Pavel
Please see my notes below. I've tried to come up with a clean way to determine the lifetime of an xrc tgt qp,\ and I think the best approach is still: 1. Allow the creating process to destroy it at any time, and 2a. If not explicitly destroyed, the tgt qp is bound to the lifetime of the

Re: [RFC] XRC upstream merge reboot

2011-07-21 Thread Jack Morgenstein
On Wednesday 20 July 2011 21:51, Hefty, Sean wrote: I've tried to come up with a clean way to determine the lifetime of an xrc tgt qp,\ and I think the best approach is still: 1. Allow the creating process to destroy it at any time, and 2a. If not explicitly destroyed, the tgt qp is

Re: [RFC] XRC upstream merge reboot

2011-07-21 Thread Jack Morgenstein
On Thursday 21 July 2011 10:38, Jack Morgenstein wrote: Having a new OFED support BOTH interfaces is a nightmare I don't even want to think about! I over-reacted here, sorry about that. I know that it will be difficult to support both the old and the new interface. However, to support the

Re: [RFC] XRC upstream merge reboot

2011-07-21 Thread Jeff Squyres
On Jul 21, 2011, at 3:38 AM, Jack Morgenstein wrote: If MPI can use a different XRC domain per job (and deallocate the domain at the job's end), this would solve the tgt qp lifetime problem (-- by destroying all the tgt qp's when the xrc domain is deallocated). What happens if the MPI job

Re: [RFC] XRC upstream merge reboot

2011-07-21 Thread Jeff Squyres
On Jul 21, 2011, at 8:47 AM, Jack Morgenstein wrote: [snip] When the last user of an XRC domain exits cleanly (or crashes), the domain should be destroyed. In this case, with Sean's design, the tgt qp's for the XRC domain should also be destroyed. Sounds perfect. -- Jeff Squyres

RE: [RFC] XRC upstream merge reboot

2011-07-21 Thread Hefty, Sean
If you use file descriptors for the XRC domain, then when the last user of the domain exits, the domain gets destroyed (at least this is in OFED. Sean's code looks the same). In this case, the kernel cleanup code for the process should close the XRC domains opened by that process, so

RE: [RFC] XRC upstream merge reboot

2011-07-21 Thread Hefty, Sean
I've tried to come up with a clean way to determine the lifetime of an xrc tgt qp,\ and I think the best approach is still: 1. Allow the creating process to destroy it at any time, and 2a. If not explicitly destroyed, the tgt qp is bound to the lifetime of the xrc domain or 2b.

RE: [RFC] XRC upstream merge reboot

2011-07-20 Thread Hefty, Sean
I've tried to come up with a clean way to determine the lifetime of an xrc tgt qp, and I think the best approach is still: 1. Allow the creating process to destroy it at any time, and 2a. If not explicitly destroyed, the tgt qp is bound to the lifetime of the xrc domain or 2b. The creating

Re: [RFC] XRC upstream merge reboot

2011-06-23 Thread Jack Morgenstein
On Wednesday 22 June 2011 22:57, Hefty, Sean wrote: We can report the creation of a tgt qp on an xrcd as an async event. To whom? to all users of the xrcd.  IMO, if we require undefined, out of band communication to use XRC, then we have an incomplete solution.  It's just too bad that

Re: [RFC] XRC upstream merge reboot

2011-06-22 Thread Jack Morgenstein
Hi Sean, Some initial feature feedback. I noticed (from the code in your git, xrc branch) that the XRC target QPs stick around until the XRC domain is de-allocated. There was a long thread about this in December, 2007, where the MPI community found this approach unacceptable (leading to

RE: [RFC] XRC upstream merge reboot

2011-06-22 Thread Hefty, Sean
I noticed (from the code in your git, xrc branch) that the XRC target QPs stick around until the XRC domain is de-allocated. There was a long thread about this in December, 2007, where the MPI community found this approach unacceptable (leading to accumulation of dead XRC TGT qp's). They

Re: [RFC] XRC upstream merge reboot

2011-06-22 Thread Jack Morgenstein
On Wednesday 22 June 2011 19:14, Hefty, Sean wrote: This is partly true, and I haven't come up with a better way to handle this. Note that the patches allow the original creator of the TGT QP to destroy it by simply calling ibv_destroy_qp(). This doesn't handle the process dying, but maybe

RE: [RFC] XRC upstream merge reboot

2011-06-22 Thread Hefty, Sean
After looking at the implementation more, what I didn't like about the reg/unreg calls is that it is independent of receiving data on an SRQ. That is, a user can receive data on an SRQ through a TGT QP before they have registered and after unregistering. That is correct, but the

Re: [RFC] XRC upstream merge reboot

2011-06-22 Thread Jack Morgenstein
I read over the threads that you referenced. I do understand what the reg/unreg calls were trying to do. In short, I agree with your original approach of letting the tgt qp hang around while the xrcd exists, and I'm not convinced what HP MPI was trying to do should drive a more

RE: [RFC] XRC upstream merge reboot

2011-06-22 Thread Tziporet Koren
For MPI, I would expect an xrcd to be associated with a single job instance. So did I, but they said that this was not the case, and they were very pleased with the final (more complicated implementation-wise) interface. We need to get them involved in this discussion ASAP.

RE: [RFC] XRC upstream merge reboot

2011-06-22 Thread Hefty, Sean
For MPI, I would expect an xrcd to be associated with a single job instance. So did I, but they said that this was not the case, and they were very pleased with the final (more complicated implementation-wise) interface. We need to get them involved in this discussion ASAP. I agree. But

Re: [RFC] XRC upstream merge reboot

2011-05-18 Thread Jack Morgenstein
Sean, Great that you are taking this on! I will review this next week. -Jack On Tuesday 17 May 2011 00:13, Hefty, Sean wrote: I've been working on a set of XRC patches aimed at upstream inclusion to the kernel, libibverbs, and librdmacm. I'm using existing patches as the major starting

RE: [RFC] XRC upstream merge reboot

2011-05-18 Thread Hefty, Sean
Great that you are taking this on! I will review this next week. Hopefully I'll have some early patches sometime next week. See below for my current thoughts based on how the implementation is progressing. My thoughts change hourly. From an architecture viewpoint, XRC adds 4 new XRC

Re: [RFC] XRC upstream merge reboot

2011-05-18 Thread Roland Dreier
On Mon, May 16, 2011 at 2:13 PM, Hefty, Sean sean.he...@intel.com wrote: libibverbs -- We define a new device capability flag IBV_DEVICE_EXT_OPS, indicating that the library supports extended operations.  If set, the provider library returns an extended structure from

Re: [RFC] XRC upstream merge reboot

2011-05-18 Thread Jason Gunthorpe
On Wed, May 18, 2011 at 09:44:01AM -0700, Roland Dreier wrote: and have support for named extensions, I think that would be even better. ie we could define a bunch of new XRC related stuff and then have some interface to the driver where we ask for the XRC extension (by name with a string)

RE: [RFC] XRC upstream merge reboot

2011-05-18 Thread Hefty, Sean
As long as the version number in the ibv_context is increasing and not branching then I think it is OK. 0 = what we have now. 1 = + XRC, 2 = +XRC+ummunotify, etc. Drivers 0 out the function pointers they do not support. I was thinking more along this line, but I can see how using a named

Re: [RFC] XRC upstream merge reboot

2011-05-18 Thread Jason Gunthorpe
On Wed, May 18, 2011 at 05:30:30PM +, Hefty, Sean wrote: As long as the version number in the ibv_context is increasing and not branching then I think it is OK. 0 = what we have now. 1 = + XRC, 2 = +XRC+ummunotify, etc. Drivers 0 out the function pointers they do not support. I was

RE: [RFC] XRC upstream merge reboot

2011-05-18 Thread Hefty, Sean
The size is 3*64 + 1*32 so there is a 32 bit pad, thus we can rewrite it as: union { struct { uint64_tremote_addr; uint32_trkey; uint32_txrc_remote_qpn; }

Re: [RFC] XRC upstream merge reboot

2011-05-18 Thread Jason Gunthorpe
On Wed, May 18, 2011 at 06:13:54PM +, Hefty, Sean wrote: You need it in the normal send case as well, either outside of the union, or part of a new struct within the union. Works for me.. union { [..] struct { uint64_t reserved1[3]; uint32_t reserved2; uint32_t

Re: [RFC] XRC upstream merge reboot

2011-05-18 Thread Roland Dreier
On Wed, May 18, 2011 at 10:30 AM, Hefty, Sean sean.he...@intel.com wrote: As long as the version number in the ibv_context is increasing and not branching then I think it is OK. 0 = what we have now. 1 = + XRC, 2 = +XRC+ummunotify, etc. Drivers 0 out the function pointers they do not support.

RE: [RFC] XRC upstream merge reboot

2011-05-18 Thread Hefty, Sean
What about something along the lines of the following? This is 2 incomplete patches squashed together, lacking any serious documentation. I believe this will support existing apps and driver libraries, as either binary or re-compiling unmodified source code. A driver library calls