Hi, I'm trying to run an OpenMPI 1.6.5 job across a set of nodes, some with Mellanox cards and some with Qlogic cards. I'm getting errors indicating "At least one pair of MPI processes are unable to reach each other for MPI communications". As far as I can tell all of the nodes are properly configured and able to reach each other, via IP and non-IP connections. I've also discovered that even if I turn off the IB transport via "--mca btl tcp,self" I'm still getting the same issue. The test works fine if I run it confined to hosts with identical IB cards. I'd appreciate some assistance in figuring out what I'm doing wrong.
Thanks, Kevin Here's a log of a failed run: > mpirun -d --debug-daemons --mca btl tcp,self --mca orte_base_help_aggregate 0 > --mca btl_base_verbose 100 -np 2 -machinefile foo.hosts > /homes/kevin/alltoall.mpi-1.6.5 [compute-g18-5.deepthought.umd.edu:20574] procdir: /tmp/openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0/63142/0/0 [compute-g18-5.deepthought.umd.edu:20574] jobdir: /tmp/openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0/63142/0 [compute-g18-5.deepthought.umd.edu:20574] top: openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0 [compute-g18-5.deepthought.umd.edu:20574] tmp: /tmp [compute-g18-5.deepthought.umd.edu:20574] mpirun: reset PATH: /cell_root/software/openmpi/1.6.5/gnu/4.8.1/threaded/sys/bin:/cell_root/software/openmpi/1.6.5/gnu/4.8.1/threaded/sys/bin:/cell_r ftware/gcc/4.8.1/sys/bin:/cell_root/software/moab/bin:/cell_root/software/gold/bin:/usr/local/ofed/1.5.4/sbin:/usr/local/ofed/1.5.4/bin:/homes/kevin/bin:/homes/kevin/bin/amd64:/dept/oit/glue/ scripts:/usr/local/scripts:/usr/local/bin:/usr/bin:/bin:/sbin:/usr/sbin:/usr/afsws/bin:/usr/afsws/etc [compute-g18-5.deepthought.umd.edu:20574] mpirun: reset LD_LIBRARY_PATH: /cell_root/software/openmpi/1.6.5/gnu/4.8.1/threaded/sys/lib:/usr/local/ofed/1.5.4/lib64 Daemon was launched on compute-g17-33.deepthought.umd.edu - beginning to initialize [compute-g17-33.deepthought.umd.edu:20174] procdir: /tmp/openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0/63142/0/1 [compute-g17-33.deepthought.umd.edu:20174] jobdir: /tmp/openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0/63142/0 [compute-g17-33.deepthought.umd.edu:20174] top: openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0 [compute-g17-33.deepthought.umd.edu:20174] tmp: /tmp Daemon [[63142,0],1] checking in as pid 20174 on host compute-g17-33.deepthought.umd.edu [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted: up and running - waiting for commands! [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received add_local_procs [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] node[0].name compute-g18-5 daemon 0 [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] node[1].name compute-g17-33 daemon 1 [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received add_local_procs MPIR_being_debugged = 0 MPIR_debug_state = 1 MPIR_partial_attach_ok = 1 MPIR_i_am_starter = 0 MPIR_forward_output = 0 MPIR_proctable_size = 2 MPIR_proctable: (i, host, exe, pid) = (0, compute-g18-5.deepthought.umd.edu, /homes/kevin/alltoall.mpi-1.6.5, 20576) (i, host, exe, pid) = (1, compute-g17-33, /homes/kevin/alltoall.mpi-1.6.5, 20175) MPIR_executable_path: NULL MPIR_server_arguments: NULL [compute-g18-5.deepthought.umd.edu:20576] procdir: /tmp/openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0/63142/1/0 [compute-g18-5.deepthought.umd.edu:20576] jobdir: /tmp/openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0/63142/1 [compute-g18-5.deepthought.umd.edu:20576] top: openmpi-sessions-ke...@compute-g18-5.deepthought.umd.edu_0 [compute-g18-5.deepthought.umd.edu:20576] tmp: /tmp [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_recv: received sync+nidmap from local proc [[63142,1],0] [compute-g18-5.deepthought.umd.edu:20576] [[63142,1],0] node[0].name compute-g18-5 daemon 0 [compute-g18-5.deepthought.umd.edu:20576] [[63142,1],0] node[1].name compute-g17-33 daemon 1 [compute-g17-33.deepthought.umd.edu:20175] procdir: /tmp/openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0/63142/1/1 [compute-g17-33.deepthought.umd.edu:20175] jobdir: /tmp/openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0/63142/1 [compute-g17-33.deepthought.umd.edu:20175] top: openmpi-sessions-ke...@compute-g17-33.deepthought.umd.edu_0 [compute-g17-33.deepthought.umd.edu:20175] tmp: /tmp [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_recv: received sync+nidmap from local proc [[63142,1],1] [compute-g17-33.deepthought.umd.edu:20175] [[63142,1],1] node[0].name compute-g18-5 daemon 0 [compute-g17-33.deepthought.umd.edu:20175] [[63142,1],1] node[1].name compute-g17-33 daemon 1 [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: Looking for btl components [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: opening btl components [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: found loaded component self [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component self has no register function [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component self open function successful [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: found loaded component tcp [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component tcp register function successful [compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component tcp open function successful [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: Looking for btl components [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: opening btl components [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: found loaded component self [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component self has no register function [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component self open function successful [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: found loaded component tcp [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component tcp register function successful [compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component tcp open function successful [compute-g17-33.deepthought.umd.:20175] select: initializing btl component self [compute-g17-33.deepthought.umd.:20175] select: init of component self returned success [compute-g17-33.deepthought.umd.:20175] select: initializing btl component tcp [compute-g17-33.deepthought.umd.:20175] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8 [compute-g17-33.deepthought.umd.:20175] btl: tcp: Found match: 127.0.0.1 (lo) [compute-g17-33.deepthought.umd.:20175] select: init of component tcp returned success [compute-g18-5.deepthought.umd.e:20576] mca: base: close: component self closed [compute-g18-5.deepthought.umd.e:20576] mca: base: close: unloading component self [compute-g18-5.deepthought.umd.e:20576] mca: base: close: component tcp closed [compute-g18-5.deepthought.umd.e:20576] mca: base: close: unloading component tcp [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received message_local_procs [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received message_local_procs -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: compute-g18-5.deepthought.umd.edu PID: 20576 -------------------------------------------------------------------------- -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[63142,1],1]) is on host: compute-g17-33.deepthought.umd.edu Process 2 ([[63142,1],0]) is on host: compute-g18-5 BTLs attempted: self tcp Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- MPI_INIT has failed because at least one MPI process is unreachable from another. This *usually* means that an underlying communication plugin -- such as a BTL or an MTL -- has either not loaded or not allowed itself to be used. Your MPI job will now abort. You may wish to try to narrow down the problem; * Check the output of ompi_info to see which BTL/MTL plugins are available. * Run your application with MPI_THREAD_SINGLE. * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, if using MTL-based communications) to see exactly which communication plugins were considered and/or discarded. -------------------------------------------------------------------------- -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: compute-g17-33.deepthought.umd.edu PID: 20175 -------------------------------------------------------------------------- [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received waitpid_fired cmd [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received iof_complete cmd [compute-g17-33.deepthought.umd.edu:20174] sess_dir_finalize: proc session dir not empty - leaving [compute-g18-5.deepthought.umd.edu:20574] sess_dir_finalize: proc session dir not empty - leaving [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received iof_complete cmd -------------------------------------------------------------------------- mpirun has exited due to process rank 1 with PID 20175 on node compute-g17-33 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received exit cmd [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received exit cmd [compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted: finalizing [compute-g18-5.deepthought.umd.edu:20574] sess_dir_finalize: job session dir not empty - leaving [compute-g17-33.deepthought.umd.edu:20174] sess_dir_finalize: job session dir not empty - leaving [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] Releasing job data for [63142,0] [compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] Releasing job data for [63142,1] [compute-g18-5.deepthought.umd.edu:20574] sess_dir_finalize: proc session dir not empty - leaving orterun: exiting with status 1