Hi,

I have stumbled upon a similar issue, so I wonder those might be related. On one of our systems I get the following error message, both when using openmpi 1.8.8 and 1.10.4

$ mpirun -debug-daemons --mca btl tcp,self --mca mca_base_verbose 100 --mca btl_base_verbose 100 ls

[...]
[compute-1-1.local:07302] mca: base: close: unloading component direct
[compute-1-1.local:07302] mca: base: close: unloading component radix
[compute-1-1.local:07302] mca: base: close: unloading component debruijn
[compute-1-1.local:07302] orte_routed_base_select: initializing selected component binomial
[compute-1-2.local:13744] [[63041,0],2]: parent 0 num_children 0
Daemon [[63041,0],2] checking in as pid 13744 on host c1-2
[compute-1-2.local:13744] [[63041,0],2] orted: up and running - waiting for commands! [compute-1-2.local:13744] [[63041,0],2] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32)
[compute-1-2.local:13744] mca: base: close: unloading component binomial
[compute-1-1.local:07302] [[63041,0],1]: parent 0 num_children 0
Daemon [[63041,0],1] checking in as pid 7302 on host c1-1
[compute-1-1.local:07302] [[63041,0],1] orted: up and running - waiting for commands! [compute-1-1.local:07302] [[63041,0],1] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32)
[compute-1-1.local:07302] mca: base: close: unloading component binomial
srun: error: c1-1: task 0: Exited with exit code 1
srun: Terminating job step 4538.1
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: c1-2: task 1: Exited with exit code 1


I have also tested version 2.0.1 - this one works without problems.

In my case the problem appears on one system with slurm versions 15.08.8 and 15.08.12. On another system running 15.08.8 all is working fine, so I guess it is not about SLURM version, but maybe system / network configuration?

Following that thought I have also noticed this thread:

http://users.open-mpi.narkive.com/PwJpWXLm/ompi-users-tcp-peer-send-blocking-send-to-socket-9-failed-broken-pipe-32-on-openvz-containers

As Jeff suggested there, I tried to run with --mca btl_tcp_if_include em1 --mca oob_tcp_if_include em1, but got the same error.

Could these problems be related to interface naming / lack of infiniband? Or to the fact that the front-end node, from which I execute mpirun, has a different network configuration? The system, on which things don't work, only has TCP network interfaces:

em1, lo (frontend has em1, em4 - local compute network, lo)

while the cluster, on which openmpi does work, uses infiniband, and had the following tcp interfaces:

eth0, eth1, ib0, lo

I would appreciate any hints..

Thanks!

Marcin


On 04/01/2016 04:16 PM, Jeff Squyres (jsquyres) wrote:
Ralph --

What's the state of PMI integration with SLURM in the v1.10.x series?  (I 
haven't kept up with SLURM's recent releases to know if something broke between 
existing Open MPI releases and their new releases...?)



On Mar 31, 2016, at 4:24 AM, Tommi T <tommi_...@yahoo.com> wrote:

Hi,

stack:
el6.7, mlnx ofed 3.1 (IB FDR) and slurm 15.08.9 (whithout *.la libs).

problem:
OpenMPI 1.10.x built with pmi support does not work when trying to use 
sbatch/salloc - mpirun combination. srun ompi_mpi_app works fine.

Older 1.8.x version works fine under same salloc session.

./configure --with-slurm --with-verbs --with-hwloc=internal --with-pmi 
--with-cuda=/appl/opt/cuda/7.5/ --with-pic --enable-shared 
--enable-mpi-thread-multiple --enable-contrib-no-build=vt


I tried 1.10.3a from git also.


mpirun  -debug-daemons ./1103aompitest
Daemon [[44437,0],1] checking in as pid 40979 on host g59
Daemon [[44437,0],2] checking in as pid 23566 on host g60
[g59:40979] [[44437,0],1] orted: up and running - waiting for commands!
[g60:23566] [[44437,0],2] orted: up and running - waiting for commands!
[g59:40979] [[44437,0],1] tcp_peer_send_blocking: send() to socket 9 failed: 
Broken pipe (32)
[g59:40979] [[44437,0],1]:errmgr_default_orted.c(260) updating exit status to 1
[g60:23566] [[44437,0],2] tcp_peer_send_blocking: send() to socket 9 failed: 
Broken pipe (32)
[g60:23566] [[44437,0],2]:errmgr_default_orted.c(260) updating exit status to 1
srun: error: g59: task 0: Exited with exit code 1
srun: Terminating job step 8922923.1
srun: Job step aborted: Waiting up to 12 seconds for job step to finish.
srun: error: g60: task 1: Exited with exit code 1
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
[login2:48425] [[44437,0],0] orted:comm:process_commands() Processing Command: 
ORTE_DAEMON_HALT_VM_CMD
[login2:48425] [[44437,0],0] orted_cmd: received halt_vm cmd


[GPU-Env mpi]$ srun ./1103aompitest
g59: Before MPI_INIT
g59: After MPI_INIT
Hello world! I'm 0 of 2 on g59
g60: Before MPI_INIT
g60: After MPI_INIT
Hello world! I'm 1 of 2 on g60

ompi_info  --parsable |grep pmi

mca:db:pmi:version:mca:2.0.0
mca:db:pmi:version:api:1.0.0
mca:db:pmi:version:component:1.10.3
mca:ess:pmi:version:mca:2.0.0
mca:ess:pmi:version:api:3.0.0
mca:ess:pmi:version:component:1.10.3
mca:grpcomm:pmi:version:mca:2.0.0
mca:grpcomm:pmi:version:api:2.0.0
mca:grpcomm:pmi:version:component:1.10.3
mca:pubsub:pmi:version:mca:2.0.0
mca:pubsub:pmi:version:api:2.0.0
mca:pubsub:pmi:version:component:1.10.3


module swap openmpi openmpi/1.8.6


[GPU-Env mpi]$ mpirun -debug-daemons ./ompigcc184
Daemon [[810,0],2] checking in as pid 55443 on host g60
Daemon [[810,0],1] checking in as pid 73091 on host g59
[g60:55443] [[810,0],2] orted: up and running - waiting for commands!
[g59:73091] [[810,0],1] orted: up and running - waiting for commands!
[login2:05014] [[810,0],0] orted_cmd: received add_local_procs
[g59:73091] [[810,0],1] orted_cmd: received add_local_procs
[g60:55443] [[810,0],2] orted_cmd: received add_local_procs
g60: Before MPI_INIT
g59: Before MPI_INIT
[g60:55443] [[810,0],2] orted_recv: received sync+nidmap from local proc 
[[810,1],1]
[g59:73091] [[810,0],1] orted_recv: received sync+nidmap from local proc 
[[810,1],0]
MPIR_being_debugged = 0
MPIR_debug_state = 1
MPIR_partial_attach_ok = 1
MPIR_i_am_starter = 0
MPIR_forward_output = 0
MPIR_proctable_size = 2
MPIR_proctable:
(i, host, exe, pid) = (0, g59, ompigcc184, 73096)
(i, host, exe, pid) = (1, g60, ompigcc184, 55448)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[login2:05014] [[810,0],0] orted_cmd: received message_local_procs
[g59:73091] [[810,0],1] orted_cmd: received message_local_procs
[g60:55443] [[810,0],2] orted_cmd: received message_local_procs
[taito-login2.csc.fi:05014] [[810,0],0] orted_cmd: received message_local_procs
[g59:73091] [[810,0],1] orted_cmd: received message_local_procs
[g60:55443] [[810,0],2] orted_cmd: received message_local_procs
g59: After MPI_INIT
Hello world! I'm 0 of 2 on g59
g60: After MPI_INIT
Hello world! I'm 1 of 2 on g60
[login2:5014] [[810,0],0] orted_cmd: received message_local_procs
[g60:55443] [[810,0],2] orted_cmd: received message_local_procs
[g59:73091] [[810,0],1] orted_cmd: received message_local_procs
[g59:73091] [[810,0],1] orted_recv: received sync from local proc [[810,1],0]
[g60:55443] [[810,0],2] orted_recv: received sync from local proc [[810,1],1]
[login2:05014] [[810,0],0] orted_cmd: received exit cmd
[g60:55443] [[810,0],2] orted_cmd: received exit cmd
[g59:73091] [[810,0],1] orted_cmd: received exit cmd
[g60:55443] [[810,0],2] orted_cmd: all routes and children gone - exiting
[g59:73091] [[810,0],1] orted_cmd: all routes and children gone - exiting


[GPU-Env mpi]$ ompi_info -parsable |grep pmi
mca:db:pmi:version:mca:2.0
mca:db:pmi:version:api:1.0
mca:db:pmi:version:component:1.8.6
mca:ess:pmi:version:mca:2.0
mca:ess:pmi:version:api:3.0
mca:ess:pmi:version:component:1.8.6
mca:grpcomm:pmi:version:mca:2.0
mca:grpcomm:pmi:version:api:2.0
mca:grpcomm:pmi:version:component:1.8.6
mca:pubsub:pmi:version:mca:2.0
mca:pubsub:pmi:version:api:2.0
mca:pubsub:pmi:version:component:1.8.6
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28866.php


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to