On Jan 8, 2008, at 7:45 AM, Lenny Verkhovsky wrote:
Hence, if HAVE_DECL_AF_INET_SDP==1 and using AF_INET_SDP fails to
that
peer, it might be desirable to try to fail over to using
AF_INET_something_else. I'm still technically on vacation :-), so I
didn't look *too* closely at your patch, but I think you're doing
that
(failing over if AF_INET_SDP doesn't work because of EAFNOSUPPORT),
which is good.
This is actually not implemented yet.
Supporting failing over requires opening AF_INET sockets in addition
to
SDP sockets, this can cause a problem in large clusters.
What I meant was try to open an SDP socket. If it fails because SDP
is not supported / available to that peer, then open a regular
socket. So you should still always have only 1 socket open to a peer
(not 2).
If one of the machine is not supporting SDP user will get an error.
Well, that's one way to go, but it's certainly less friendly. It
means that the entire MPI job has to support SDP -- including mpirun.
What about clusters that do not have IB on the head node?
Perhaps a more general approach would be to [perhaps additionally]
provide an MCA param to allow the user to specify the AF_* value?
(AF_INET_SDP is a standardized value, right? I.e., will it be the
same on all Linux variants [and someday Solaris]?)
I didn't find any standard on it, it seems to be "randomly" selected
since the originally it was 26 and changed to 27 due to conflict with
kernel's defines.
This might make an even stronger case for having an MCA param for it
-- if the AF_INET_SDP value is so broken that it's effectively random,
it may be necessary to override it on some platforms (especially in
light of binary OMPI and OFED distributions that may not match).
Patrick's got a good point: is there a reason not to do this?
(LD_PRELOAD and the like) Is it problematic with the remote orted's?
Yes, it's problematic with remote orted's and it not really
transparent
as you might think.
Since we can't pass environments' variables to the orted's during
runtime
I think this depends on your environment. If you're not using rsh
(which you shouldn't be for a large cluster, which is where SDP would
matter most, right?), the resource manager typically copies the
environment out to the cluster nodes. So an LD_PRELOAD value should
be set for the orteds as well.
I agree that it's problematic for rsh, but that might also be solvable
(with some limits; there's only so many characters that we can pass on
the command line -- we did investigate having a wrapper to the orted
at one point to accept environment variables and then launch the
orted, but this was so problematic / klunky that we abandoned the idea).
we must preload sdp library to each remote environment ( i.e.
bashrc ) This will cause all applications to use SDP instead of
AF_INET.
Which means you can't choose specific protocol for specific
application,
either you are using SDP or AF_INET for all.
SDP also can be loaded with appropriate /usr/local/ofed/etc/
libsdp.conf
configuration but a simple user have no access to it usually.
(http://www.cisco.com/univercd/cc/td/doc/product/svbu/ofed/ofed_1_1/ofed
_ug/sdp.htm#wp952927)
Andrew's got a point point here, too -- accelerating the TCP BTL with
SDP seems kinda pointless. I'm guessing that you did it because it
was just about the same work as was done in the TCP OOB (for which we
have no corresponding verbs interface). Is that right?
Indeed. But it also seems that SDP has lower overhead than VERBS in
some
cases.
Are you referring to the fact that the avail(%) column is lower for
verbs than SDP/IPoIB? That seems like a pretty weird metric for such
small message counts. What exactly does 77.5% of 0 bytes mean?
My $0.02 is that the other columns are more compelling. :-)
Tests with Sandia's overlapping benchmark
http://www.cs.sandia.gov/smb/overhead.html#mozTocId316713
VERBS results
msgsize iterations iter_t work_t overhead base_t
avail(%)
0 1000 16.892 15.309 1.583 7.029
77.5
2 1000 16.852 15.332 1.520 7.144
78.7
4 1000 16.932 15.312 1.620 7.128
77.3
8 1000 16.985 15.319 1.666 7.182
76.8
16 1000 16.886 15.297 1.589 7.219
78.0
32 1000 16.988 15.311 1.677 7.251
76.9
64 1000 16.944 15.299 1.645 7.457
77.9
SDP results
0 1000 134.902 128.089 6.813 54.691
87.5
2 1000 135.064 128.196 6.868 55.283
87.6
4 1000 135.031 128.356 6.675 55.039
87.9
8 1000 130.460 125.908 4.552 52.010
91.2
16 1000 135.432 128.694 6.738 55.615
87.9
32 1000 135.228 128.494 6.734 55.627
87.9
64 1000 135.470 128.540 6.930 56.583
87.8
IPoIB results
0 1000 252.953 247.053 5.900 119.977
95.1
2 1000 253.336 247.285 6.051 121.573
95.0
4 1000 254.147 247.041 7.106 122.110
94.2
8 1000 254.613 248.011 6.602 121.840
94.6
16 1000 255.662 247.952 7.710 124.738
93.8
32 1000 255.569 248.057 7.512 127.095
94.1
64 1000 255.867 248.308 7.559 132.858
94.3
--
Jeff Squyres
Cisco Systems