On 11/14/2013 12:16 PM, Jeff Squyres (jsquyres) wrote:
On Nov 14, 2013, at 1:03 PM, Ralph Castain <r...@open-mpi.org> wrote:

1) What the status of UDCM is (does it work reliably, does it support
XRC, etc.)
Seems to be working okay on the IB systems at LANL and IU. Don't know about XRC - I seem 
to recall the answer is "no"
FWIW, I recall that when Cisco was testing UDCM (a long time ago -- before we 
threw away our IB gear...), we found bugs in UDCM that only showed up with 
really large numbers of MTT tests running UDCM (i.e., 10K+ tests a night, 
especially with lots of UDCM-based jobs running concurrently on the same 
cluster).  These types of bugs didn't show up in casual testing.

Has that happened with the new/fixed UDCM?  Cisco is no longer in a position to 
test this.

2) What's the difference between CPCs and OFACM and what's our plans
w.r.t 1.7 there?
Pasha created ofacm because some of the collective components now need to forge 
connections. So he created the common/ofacm code to meet those needs, with the 
intention of someday replacing the openib cpc's with the new common code. 
However, this was stalled by the iWarp issue, and so it fell off the table.

Perhaps if Pasha or somebody else proficient in the OMPI code could help out, then the iWARP CPC could be moved. W/O help from OMPI developers, its going to take me a very long time...


We now have two duplicate ways of doing the same thing, but with code in two 
different places. :-(
FWIW, the iWARP vendors have repeatedly been warned that ofacm is going to take 
over, and unless they supply patches, iWarp will stop working in Open MPI.  I 
know for a fact that they are very aware of this.

So my $0.02 is that ofacm should take over -- let's get rid of CPC and have 
openib use the ofacm.  The iWarp folks can play catch up if/when they want to.

Of course, I'm not in this part of the code base any more, so it's not really 
my call -- just my $0.02...


Can't we leave the openib rdma CPC code as is until we can get the rdmacm CPC moved into OFACM. What is the harm with that exactly? I mean, if no iWARP devices support these accelerated MPI collectives, then leave the rdmacm CPC in the openib btl so we can at least support iWARP via the openib BTL...

Steve.

Reply via email to