Re: [OMPI users] How OMPI picks ethernet interfaces

Gus Correa Thu, 13 Nov 2014 13:05:27 -0500 (EST)

Hi Reuti

See below, please.


On 11/13/2014 07:19 AM, Reuti wrote:

Gus,

Am 13.11.2014 um 02:59 schrieb Gus Correa:

On 11/12/2014 05:45 PM, Reuti wrote:

Am 12.11.2014 um 17:27 schrieb Reuti:

Am 11.11.2014 um 02:25 schrieb Ralph Castain:

Another thing you can do is (a) ensure you built with —enable-debug,

and then (b) run it with -mca oob_base_verbose 100
(without the tcp_if_include option) so we can watch
the connection handshake and see what it is doing.
The —hetero-nodes will have not affect here and can be ignored.


Done. It really tries to connect to the outside

interface of the headnode. But being there a firewall or not:
the nodes have no clue how to reach 137.248.0.0 -
they have no gateway to this network at all.

I have to revert this.
They think that there is a gateway although it isn't.
When I remove the entry by hand for the gateway in the
routing table it starts up instantly too.

While I can do this on my own cluster I still have the
30 seconds delay on a cluster where I'm not root,
while this can be because of the firewall there.
The gateway on this cluster is indeed going
to the outside world.

Personally I find this behavior a little bit too aggressive
to use all interfaces. If you don't check this carefully
beforehand and start a long running application one might
even not notice the delay during the startup.

-- Reuti


Hi Reuti

You could use the mca parameter file
(say, $prefix/etc/openmpi-mca-params.conf) to configure cluster-wide
the oob (and btl) interfaces to be used.
The users can still override your choices if they want.

Just put a line like this in openmpi-mca-params.conf :
oob_tcp_if_include=192.168.154.0/26

(and similar for btl_tcp_if_include, btl_openib_if_include).

Get a full list from "ompi_info --all --all |grep if_include".

See these FAQ:

http://www.open-mpi.org/faq/?category=tcp#tcp-selection
http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

Compute nodes tend to be multi-homed, so what criterion would OMPI use
to select one interface among many,


My compute nodes are having two interfaces:
one for MPI (and the low ssh/SGE traffic to start processes somewhere)
and one for NFS to transfer files from/to the file server.
So: Open MPI may use both interfaces without telling me anything about it?
How will it split the traffic? 50%/50%?


Honestly, I don't know.
My suggestion is to pick the interface you really want via mca parameters.

Not sure if there is a way to make OMPI report what it is doing in thisregard when many interfaces are available.


Ralph, Jeff, OMPI experts:

If we write -mca btl_tcp_if_include eth0,eth1

will the interface order (eth0,eth1 as opposed to eth1,eth0) have anyimpact on how they are used by OMPI.

If we don't write anything, will OMPI somehow select and use the fastest?
Will it drop the slowest or the slower ones?
How will it split the traffic?

When there is a heavy file transfer on the NFS interface:
might it hurt Open MPI's communication or will it balance the usage on-the-fly?


I prefer to separate the two traffic lanes.
NFS has been by far a much more frequent source
of trouble than (Open)MPI.

Hence, I setup the mca parameter file to use the IB interface ("openib"and "sm" - which is now "vader" after 1.8.X- and "self") for btl (Ididn't bother to set it for oob), and mount NFS on Ethernet

I tried NFS4 over rdma and over IPoIB and the results weren't so good.

I didn't have much time to test/experiment (NFSv3 etc), so moved toproduction that way.

When I prepare a machinefile with the name of the interfaces

(or get the names from SGE's PE_HOSTFILE) it should use just this(except native IB),

and not looking around for other paths to the other machine(s) (IMO).

Normally I don't handcraft the machine file, just use what the resource
manager gives (Torque in my case).
Some programs (say MPMD) require handcrafting (or using the mpiexec

Based on your other emails to Ralph, I believe we are talking about twoend members in the way the interfaces are picked:

1) The current OMPI approach: In principle use everything, and let you(and/or the user) re-configure it with mca parameters (in gazillions of

different ways, very flexible) to fit your computer configuration and
needs.

I do like this, because I guess it would be very difficult for OMPI tomake a specific interface choice (and get it right) not knowing

the details of your network setup, name resolution, etc.
[Just see the NFS vs. MPI networks that you, me, and others have.]

Also, you can simply use the mca parameter configuration file to your
advantage, and to fine tune everything to your clusters.
Still, power users can override your settings if they need, bu using
their own .opempi mca configuration files, or environment variables,
or mpiexec mca command line options.

2) What you seem to be proposing which is to use whatever IP addressthat resolves to the host names in the hostfile (handcrafted, given bythe RM, or on the fly via --hosts option), or the IP themselves insteadof names.

[I hope I understood you right and did not misrepresent what you said.]

I confess I don't see this as an advantage.

It is unlikely that your hostnames resolve to the right interface thatyou want to use for MPI (and actually your and my IB interfaces do *not*resolve to the target IP/inteface, it is the Ethernet that does resolvethis way, and you don't want to use Ethernet for MPI).



Therefore different interfaces have different names in my setup.
"node01" is just eth0 and different from "node01-nfs" for eth1.

not knowing beforehand what exists in a particular computer?
There would be a risk to make a bad choice.
The current approach gives you everything, and you
pick/select/restrict what you want to fit your needs,
with mca parameters (which can be set in several
ways and with various scopes).

I don't think this bad.
However, I am biased about this.
I like and use the openmpi-mca-params.conf file
to setup sensible defaults.
At least I think they are sensible. :)


I see that this can be prepared for all users this way.
Whenever they use my installed version it will work -
maybe I have to investigate on some other clusters where
I'm not root what to enter there, but it can be done for sure.


If you need to support OMPI users in clusters where you don't
have root access,
you can either ask the sysadmin to insert the mca configuration file in

the $OMPI/etc directory, or you can give to each user a tailored.openmpi file to put in their home directory.

There are certainly other ways also.
See the FAQ:
http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

IMHO, this mca parameter flexibility is really one of the big upsides of
OMPI.

The downside may be that there are just gazillions of parameters,
and not nearly enough clarity (and documentation) on what one should do
to do to choose good values.

(I am struggling with the sm/vader parameters, parameters forcollectives, etc).

However, for the network setup I think that is not so hard,
more or less straightforward once you know the cluster network details.
Ralph and Jeff will kill me for complaining about documentation for the
gazillionth time ...


BUT: it may be a rare situation that a group for quantum chemistry
is having a sysadmin on their own taking care of the clusters and
the well behaving operation of the installed software, being it
applications or libraries. Often any PhD student in other groups
will get a side project: please install software XY for the group.
They are chemists and want to get the software running -
they are no experts of Open MPI*.


We have atmosphere/ocean/climate science folks here,
but that is where the differences stop.

They are certainly not MPI experts (and not much interested in thosedetails).

So, that is why providing a set of sensible defaults is a good thing.

Or perhaps, if needed, you can think of a set of defaults depending onthe application (and distribute them via the MCA aggregate parameters(same FAQ):

http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

They don't care for a tight integration or using the correct interfaces as
long as the application delivers the results in the end. For example: ORCA**.
It's necessary for the users of the software to install a shared library of
Open MPI in a specific version. I see in the ORCA*** forum that many
struggle with it to compile a shared library version of Open MPI and
have access to it during execution, i.e. how to set LD_LIBRARY_PATH
that it's known on the slaves. The cluster admins are in another
department and refuse to make any special arrangements for a
single group sometimes.

Fortunately, I don't have these political boundaries in the clusters Ioversee.We provide as much as we can in terms of builds of OMPI and otherlibraries, and make them available through enviroment modules.

I guess this is pretty much what you and many others do,
along with Wikis or README files on how to do basic stuff,
or compile and run the most popular programs.

However, in the university-wide cluster the situation is similar to what
you describe, and there is a lot of user disappointment with that.

I such situations, I've seen people installing OpenMPI in their homedirectory, etc.You could try to do this for the users, but that is a bit of an insanejob, too much work spread across many users.

However,

quite frankly I think that is this is cluster sysadmin duty, too bad ifthe "other department" fellows don't consider that a service too the users.

And as ORCA calls `mpiexec` several
times during one job, the delay could occur several times.


If it is in a cluster that you don't have control of, give the users a
.openmpi/mca-params.conf file.
That may help.

On some other clusters that we have access to, the admins prepare
Open MPI installations accessible by `modules`. But often not for the
required combination of Open MPI and compiler type and version which is needed.

I understand the problem, but if you cannot convince the sys admins tobuild OMPI and other libraries needed, the only way to solve the problemis to teach users to install OMPI in their own home directories(assuming it is mounted cluster-wide).

Or do it for them.

If a software vendor suggests to use compiler X in version Y it's the best to
follow that approach as it will generate less issues which might need to be
investigated - i.e. numerical variations as different compilers optimize in a
different way. Hence you end up to compile the necessary Open MPI on your own
again and set again sensible defaults as you lay out above.


We have to build many combinations with different compilers also.
Some programs only compile with Intel, others with PGI, a few are kind

enough to compile with gfortran, and it is not only OMPI, but otherlibraries as well (HDF5, NetCDF, etc).One sticking point is that Fortran >90 modules are not compatible acrosscompilers.


Cheers,
Gus Correa

Continued in 2nd email...

-- Reuti

*) Sure, there are exceptions and experts too -
I don't intend to offend anyone by this statement.
But I speak for the groups of QC I have had contact
to in the last couple of years.

**) http://www.cec.mpg.de/forum/portal.php

***) The current ORCA needs 1.6.5, but it may change in one point in the future.

Cheers,
Gus Correa

It tries so independent from the internal or external name of the headnode

given in the machinefile - I hit ^C then.
I attached the output of Open MPI 1.8.1 for this setup too.


-- Reuti

<openmpi1.8.3.txt><openmpi1.8.1.txt>_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25777.php


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25781.php


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25784.php


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25799.php

Re: [OMPI users] How OMPI picks ethernet interfaces

Reply via email to