On Jan 8, 2008, at 7:45 AM, Lenny Verkhovsky wrote:

Hence, if HAVE_DECL_AF_INET_SDP==1 and using AF_INET_SDP fails to that
peer, it might be desirable to try to fail over to using
AF_INET_something_else.  I'm still technically on vacation :-), so I
didn't look *too* closely at your patch, but I think you're doing that
(failing over if AF_INET_SDP doesn't work because of EAFNOSUPPORT),
which is good.
This is actually not implemented yet.
Supporting failing over requires opening AF_INET sockets in addition to
SDP sockets, this can cause a problem in large clusters.

What I meant was try to open an SDP socket. If it fails because SDP is not supported / available to that peer, then open a regular socket. So you should still always have only 1 socket open to a peer (not 2).

If one of the machine is not supporting SDP user will get an error.

Well, that's one way to go, but it's certainly less friendly. It means that the entire MPI job has to support SDP -- including mpirun. What about clusters that do not have IB on the head node?

Perhaps a more general approach would be to [perhaps additionally]
provide an MCA param to allow the user to specify the AF_* value?
(AF_INET_SDP is a standardized value, right?  I.e., will it be the
same on all Linux variants [and someday Solaris]?)
I didn't find any standard on it, it seems to be "randomly" selected
since the originally it was 26 and changed to 27 due to conflict with
kernel's defines.

This might make an even stronger case for having an MCA param for it -- if the AF_INET_SDP value is so broken that it's effectively random, it may be necessary to override it on some platforms (especially in light of binary OMPI and OFED distributions that may not match).

Patrick's got a good point: is there a reason not to do this?
(LD_PRELOAD and the like)  Is it problematic with the remote orted's?
Yes, it's problematic with remote orted's and it not really transparent
as you might think.
Since we can't pass environments' variables to the orted's during
runtime

I think this depends on your environment. If you're not using rsh (which you shouldn't be for a large cluster, which is where SDP would matter most, right?), the resource manager typically copies the environment out to the cluster nodes. So an LD_PRELOAD value should be set for the orteds as well.

I agree that it's problematic for rsh, but that might also be solvable (with some limits; there's only so many characters that we can pass on the command line -- we did investigate having a wrapper to the orted at one point to accept environment variables and then launch the orted, but this was so problematic / klunky that we abandoned the idea).

we must preload sdp library to each remote environment ( i.e.
bashrc ) This will cause all applications to use SDP instead of AF_INET. Which means you can't choose specific protocol for specific application,
either you are using SDP or AF_INET for all.
SDP also can be loaded with appropriate /usr/local/ofed/etc/ libsdp.conf
configuration but a simple user have no access to it usually.
(http://www.cisco.com/univercd/cc/td/doc/product/svbu/ofed/ofed_1_1/ofed
_ug/sdp.htm#wp952927)

Andrew's got a point point here, too -- accelerating the TCP BTL with
SDP seems kinda pointless.  I'm guessing that you did it because it
was just about the same work as was done in the TCP OOB (for which we
have no corresponding verbs interface).  Is that right?
Indeed. But it also seems that SDP has lower overhead than VERBS in some
cases.

Are you referring to the fact that the avail(%) column is lower for verbs than SDP/IPoIB? That seems like a pretty weird metric for such small message counts. What exactly does 77.5% of 0 bytes mean?

My $0.02 is that the other columns are more compelling.  :-)

Tests with Sandia's overlapping benchmark
http://www.cs.sandia.gov/smb/overhead.html#mozTocId316713

VERBS results
msgsize iterations  iter_t      work_t      overhead    base_t
avail(%)
0 1000 16.892 15.309 1.583 7.029 77.5 2 1000 16.852 15.332 1.520 7.144 78.7 4 1000 16.932 15.312 1.620 7.128 77.3 8 1000 16.985 15.319 1.666 7.182 76.8 16 1000 16.886 15.297 1.589 7.219 78.0 32 1000 16.988 15.311 1.677 7.251 76.9 64 1000 16.944 15.299 1.645 7.457 77.9

SDP results
0 1000 134.902 128.089 6.813 54.691 87.5 2 1000 135.064 128.196 6.868 55.283 87.6 4 1000 135.031 128.356 6.675 55.039 87.9 8 1000 130.460 125.908 4.552 52.010 91.2 16 1000 135.432 128.694 6.738 55.615 87.9 32 1000 135.228 128.494 6.734 55.627 87.9 64 1000 135.470 128.540 6.930 56.583 87.8

IPoIB results
0 1000 252.953 247.053 5.900 119.977 95.1 2 1000 253.336 247.285 6.051 121.573 95.0 4 1000 254.147 247.041 7.106 122.110 94.2 8 1000 254.613 248.011 6.602 121.840 94.6 16 1000 255.662 247.952 7.710 124.738 93.8 32 1000 255.569 248.057 7.512 127.095 94.1 64 1000 255.867 248.308 7.559 132.858 94.3


--
Jeff Squyres
Cisco Systems

Reply via email to