On 11/14/13 11:16 AM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote:

>On Nov 14, 2013, at 1:03 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>>> 1) What the status of UDCM is (does it work reliably, does it support
>>> XRC, etc.)
>> 
>> Seems to be working okay on the IB systems at LANL and IU. Don't know
>>about XRC - I seem to recall the answer is "no"
>
>FWIW, I recall that when Cisco was testing UDCM (a long time ago --
>before we threw away our IB gear...), we found bugs in UDCM that only
>showed up with really large numbers of MTT tests running UDCM (i.e., 10K+
>tests a night, especially with lots of UDCM-based jobs running
>concurrently on the same cluster).  These types of bugs didn't show up in
>casual testing.
>
>Has that happened with the new/fixed UDCM?  Cisco is no longer in a
>position to test this.

Neither are we at Sandia, unfortunately.  I only have 16 nodes for nightly
testing, and only 8 of those are always running Linux, so that doesn't
help much on the stress test.

>>> 2) What's the difference between CPCs and OFACM and what's our plans
>>> w.r.t 1.7 there?
>> 
>> Pasha created ofacm because some of the collective components now need
>>to forge connections. So he created the common/ofacm code to meet those
>>needs, with the intention of someday replacing the openib cpc's with the
>>new common code. However, this was stalled by the iWarp issue, and so it
>>fell off the table.
>> 
>> We now have two duplicate ways of doing the same thing, but with code
>>in two different places. :-(
>
>FWIW, the iWARP vendors have repeatedly been warned that ofacm is going
>to take over, and unless they supply patches, iWarp will stop working in
>Open MPI.  I know for a fact that they are very aware of this.
>
>So my $0.02 is that ofacm should take over -- let's get rid of CPC and
>have openib use the ofacm.  The iWarp folks can play catch up if/when
>they want to.  
>
>Of course, I'm not in this part of the code base any more, so it's not
>really my call -- just my $0.02...

Of course, that doesn't help with the core issue; we can't have a
regression w.r.t XRC support between 1.7.3 and 1.7.4.  But I agree, I'm
fine with only fixing this in one place.

Brian

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories




Reply via email to