On Nov 14, 2013, at 12:21 PM, Barrett, Brian W <bwba...@sandia.gov> wrote:

> On 11/14/13 1:13 PM, "Joshua Ladd" <josh...@mellanox.com> wrote:
> 
>> Let me try to summarize my understanding of the situation:
>> 
>> 1. Ralph made the OOB asynchronous.
>> 
>> 2. OOB cpcs don't work as a result of 1, and are thereby "deprecated",
>> meaning: won't fix.
>> 
>> 3. Pasha moved the openib/connect to common/ofacm but excluded the rdmacm
>> in that move.  Never changed openib to use ofacm/common.
>> 
>> 4. UDCM is "functional" in the trunk, still sitting in openib/connect.
>> But no one is entirely sure if it really works which is why it was
>> disabled in 1.7. Nathan - is there a design doc you can share on this
>> beyond the comments in the code?
>> 
>> 5. In order to satisfy the "grand plan":
>>      a. UDCM still needs to be moved to common/ofacm.
>>               b. OpenIB still needs to be changed to use common/ofacm.
>>               c.  RDMACM still needs to migrate to common/ofacm.
>>               d. XRC support needs to be added to UDCM and put into
>> common/ofacm.
>> 
>> 6. The "grand plan" being:  move the BTLs into Opal - hence the need to
>> scuttle the OOB cpcs thereby justifying the deprecation and not fixing
>> cpcs after #1.
>> 
>> So, that's a quick roundup of how we ended up here (as I understand it.)
>> What needs to be done is:
> 
> That's my understanding as well.
> 
>> 1. Somebody needs to certify/review/ that what Nathan has done is sound.
>> From my perspective, this is a BIG change and needs a comprehensive
>> architecture review. We've been using it in the trunk, and we've been
>> testing it under MTT for some time - but have not deployed or tested at
>> large-scale out in the field. Would be nice to see something on paper in
>> terms of a design doc.
>> 
>> 2. Somebody then needs to move UDCM into common/ofacm.
>> 
>> 3. Somebody needs to change openib to use common/ofacm cpcs instead of
>> openib/connect cpcs.
>> 
>> 4. Somebody needs to move RDMACM into common/ofacm and make sure RoCEE
>> works.
>> 
>> 5. Somebody needs to add XRC support to UDCM - whatever that might mean.
>> Given Nathan added UDCM back in 2011 and nobody is really sure it's ready
>> for prime-time, and given Pasha's comments regarding the difference in
>> state machine requirements  between the two connection schemes, this
>> doesn't seem like a trivial task.
>> 
>> Given Nathan's comments a second ago about ORNL not supporting the IB
>> Offload component, it barely makes sense to keep common/ofacm. And it
>> sounds like the two cpcs presently contained therein are now unusable.
>> 
>> All of this work is a result of the Grand Plan to move the BTLs into the
>> Opal layer - which I have no idea what the motive is (I was not involved
>> with OMPI when this was decided or debated.)
>> 
>> Basically, without these five changes OpenIB is dead in 1.7.4 and beyond
>> for RC, XRC, and RoCEE. These are blockers to 1.7.4 and I don't believe
>> that the onus falls squarely on Mellanox to fix these. These were
>> community decisions and, as such, it must be a community effort to
>> resolve. We are happy to lend a hand, but we are not fixing all of this
>> mess.
> 
> I think that the 5 steps above sound correct and I agree that 1) this
> means 1.7.4 is on hold until we fix this and 2) that Mellanox shouldn't be
> the only one to fix this for 1.7.4, given the amount of work involved.
> 
> Ralph, what, specifically, broke about the oob/xoob cpc mechanisms by
> making the oob asynchronous?

Hard for me to say as I don't really have access to an IB machine any more. 
Odin is my sole reference point, and someone has had that fully locked up for 
more than a week (and I can't complain as I am totally a guest there). Even 
then, I can only test on a few nodes.

I have no objection to helping, but we need someone who cares about IB and has 
access to such a machine to take the lead. Otherwise, we're just spinning our 
wheels.

As for the work issue: note that this has been "under development" now for more 
than a year. We've talked at length about how "somebody" needs to fix the 
openib/ofacm issue, but everyone keeps pushing it down the road as "not mine". 
Like I said, I can help - but (a) my boss couldn't care less about this issue, 
and (b) I have no way to test the results.



>  That is, 1-5 are a huge amount of work; have
> we done the analysis to say that updating the oob / xoob cpc to work with
> the new oob is actually more work than doing 1-5?  Obviously, there's long
> term plans that make oob/xoob problematic.  But those aren't 1.7 / 1.8
> plans.  Unfortunately, the cpcs were always out of my area of interest, so
> I'm flying a bit more blind than I'd like here.
> 
> Brian
> 
> --
>  Brian W. Barrett
>  Scalable System Software Group
>  Sandia National Laboratories
> 
> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to