Re: [OMPI users] mpiexec seems to be resolving names on server insteadof each node

Jeff Squyres Wed, 6 May 2009 16:52:34 -0400

Sorry for the delay in replying; I kept starting to look into this andthen getting distracted by shiny objects. :-(

OMPI v1.3 actually has a fairly sophisticated TCP address/networkmatching algorithm. The hostname resolution shouldn't really be theissue; OMPI directly queries the kernel IP interfaces on each nodewhere it runs and publishes that to all other MPI processes. Then theTCP network matching algorithm is used to select the "best" pairs toconnect to.


Per your diagram:

    192.168.1.1      192.168.1.2
hubert ------------------------ fry
  |    \                    /    | 192.168.4.1
  |       \              /       |
  |          \        /          |
  |             \  /             |
  |             /  \             |
  |          /        \          |
  |       /              \       |
  |    /                     \   | 192.168.4.2
hermes ----------------------- leelas


I assume that the netmasks are all class C, right?

If so, I might ask you to run some diagnostic OMPI builds so that wecan see what the matching algorithm is doing on your machine...




On Apr 17, 2009, at 8:45 PM, Micha Feigin wrote:

I am having problems running openmpi 1.3 on my claster and I waswondering ifanyone else is seeing this problem and/or can give hints on how tosolve it
As far as I understand the error, mpiexec resolves host names on themaster nodeit is run on instead of an each host seperately. This works in anenvironment whereeach hostname resolves to the same address on each host (clusterconnected via aswitch) but fails where it resolves to different addresses (ring/star setups forexample where each computer is connected directly to all/some of theothers)
I'm not 100% sure that this is the problem as I'm seeing success ona singlecase where this should probably fail but it is my best bet from theerror message.
version 1.2.8 worked fine for the same simple program (a simplehellow world that
just comunicated the computer name for each process)

An example output:
mpiexec is run on the master node hubert and is set to run theprocesses on two nodesfry and leela. As is understood from the error messages leela triesto connect tofry on address 192.168.1.2 which is it's address on hubert but notleela (where it
is 192.168.4.1)

This is a four node claster all interconnected

    192.168.1.1      192.168.1.2
hubert ------------------------ fry
  |    \                    /    | 192.168.4.1
  |       \              /       |
  |          \        /          |
  |             \  /             |
  |             /  \             |
  |          /        \          |
  |       /              \       |
  |    /                     \   | 192.168.4.2
hermes ----------------------- leelas

=================================================================
mpiexec -np 8 -H fry,leela test_mpi
Hello MPI from the server process of 8 on fry!
[[36620,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable toconnect to the peer 192.168.1.2 on port 154: Network is unreachable
[[36620,1],3][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable toconnect to the peer 192.168.1.2 on port 154: Network is unreachable
[[36620,1],7][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable toconnect to the peer 192.168.1.2 on port 154: Network is unreachable
[leela:4436] *** An error occurred in MPI_Send
[leela:4436] *** on communicator MPI_COMM_WORLD
[leela:4436] *** MPI_ERR_INTERN: internal error
[leela:4436] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[[36620,1],5][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable toconnect to the peer 192.168.1.2 on port 154: Network is unreachable
--------------------------------------------------------------------------
mpiexec has exited due to process rank 1 with PID 4433 on
node leela exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
[hubert:11312] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal[hubert:11312] Set MCA parameter "orte_base_help_aggregate" to 0 tosee all help / error messages
=================================================================
This seems to be a directional issue as running the program -Hfry,leela faileswhere -H leela,fry works, same behaviour for all senarious exceptthose that includethe master node (hubert) where it resolves the external ip (from anexternal dns) insteadof the internal ip (from the hosts file). thus one direction fails(no external connection
at the moment for all but the master) and the other causes a lockup

I hope that the explenation is not too convoluted
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

Re: [OMPI users] mpiexec seems to be resolving names on server insteadof each node

Reply via email to