On 11/14/13 11:16 AM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote:
>On Nov 14, 2013, at 1:03 PM, Ralph Castain <r...@open-mpi.org> wrote: > >>> 1) What the status of UDCM is (does it work reliably, does it support >>> XRC, etc.) >> >> Seems to be working okay on the IB systems at LANL and IU. Don't know >>about XRC - I seem to recall the answer is "no" > >FWIW, I recall that when Cisco was testing UDCM (a long time ago -- >before we threw away our IB gear...), we found bugs in UDCM that only >showed up with really large numbers of MTT tests running UDCM (i.e., 10K+ >tests a night, especially with lots of UDCM-based jobs running >concurrently on the same cluster). These types of bugs didn't show up in >casual testing. > >Has that happened with the new/fixed UDCM? Cisco is no longer in a >position to test this. Neither are we at Sandia, unfortunately. I only have 16 nodes for nightly testing, and only 8 of those are always running Linux, so that doesn't help much on the stress test. >>> 2) What's the difference between CPCs and OFACM and what's our plans >>> w.r.t 1.7 there? >> >> Pasha created ofacm because some of the collective components now need >>to forge connections. So he created the common/ofacm code to meet those >>needs, with the intention of someday replacing the openib cpc's with the >>new common code. However, this was stalled by the iWarp issue, and so it >>fell off the table. >> >> We now have two duplicate ways of doing the same thing, but with code >>in two different places. :-( > >FWIW, the iWARP vendors have repeatedly been warned that ofacm is going >to take over, and unless they supply patches, iWarp will stop working in >Open MPI. I know for a fact that they are very aware of this. > >So my $0.02 is that ofacm should take over -- let's get rid of CPC and >have openib use the ofacm. The iWarp folks can play catch up if/when >they want to. > >Of course, I'm not in this part of the code base any more, so it's not >really my call -- just my $0.02... Of course, that doesn't help with the core issue; we can't have a regression w.r.t XRC support between 1.7.3 and 1.7.4. But I agree, I'm fine with only fixing this in one place. Brian -- Brian W. Barrett Scalable System Software Group Sandia National Laboratories