Ralph,

Here you go:

(1080) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
--leave-session-attached --debug-daemons --mca oob_base_verbose 10 -np 8
./helloWorld.182-debug.x
[borg01x142:29232] mca: base: components_register: registering oob
components
[borg01x142:29232] mca: base: components_register: found loaded component
tcp
[borg01x142:29232] mca: base: components_register: component tcp register
function successful
[borg01x142:29232] mca: base: components_open: opening oob components
[borg01x142:29232] mca: base: components_open: found loaded component tcp
[borg01x142:29232] mca: base: components_open: component tcp open function
successful
[borg01x142:29232] mca:oob:select: checking available component tcp
[borg01x142:29232] mca:oob:select: Querying component [tcp]
[borg01x142:29232] oob:tcp: component_available called
[borg01x142:29232] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x142:29232] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x142:29232] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.1.25.142 to our
list of V4 connections
[borg01x142:29232] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x142:29232] [[52298,0],0] oob:tcp:init adding 172.31.1.254 to our
list of V4 connections
[borg01x142:29232] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.12.25.142 to our
list of V4 connections
[borg01x142:29232] [[52298,0],0] TCP STARTUP
[borg01x142:29232] [[52298,0],0] attempting to bind to IPv4 port 0
[borg01x142:29232] [[52298,0],0] assigned IPv4 port 41686
[borg01x142:29232] mca:oob:select: Adding component to end
[borg01x142:29232] mca:oob:select: Found 1 active transports
srun.slurm: cluster configuration lacks support for cpu binding
srun.slurm: cluster configuration lacks support for cpu binding
[borg01x153:01290] mca: base: components_register: registering oob
components
[borg01x153:01290] mca: base: components_register: found loaded component
tcp
[borg01x143:13793] mca: base: components_register: registering oob
components
[borg01x143:13793] mca: base: components_register: found loaded component
tcp
[borg01x153:01290] mca: base: components_register: component tcp register
function successful
[borg01x153:01290] mca: base: components_open: opening oob components
[borg01x153:01290] mca: base: components_open: found loaded component tcp
[borg01x153:01290] mca: base: components_open: component tcp open function
successful
[borg01x153:01290] mca:oob:select: checking available component tcp
[borg01x153:01290] mca:oob:select: Querying component [tcp]
[borg01x153:01290] oob:tcp: component_available called
[borg01x153:01290] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x153:01290] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x153:01290] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.1.25.153 to our
list of V4 connections
[borg01x153:01290] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x153:01290] [[52298,0],4] oob:tcp:init adding 172.31.1.254 to our
list of V4 connections
[borg01x153:01290] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.12.25.153 to our
list of V4 connections
[borg01x153:01290] [[52298,0],4] TCP STARTUP
[borg01x153:01290] [[52298,0],4] attempting to bind to IPv4 port 0
[borg01x143:13793] mca: base: components_register: component tcp register
function successful
[borg01x153:01290] [[52298,0],4] assigned IPv4 port 38028
[borg01x143:13793] mca: base: components_open: opening oob components
[borg01x143:13793] mca: base: components_open: found loaded component tcp
[borg01x143:13793] mca: base: components_open: component tcp open function
successful
[borg01x143:13793] mca:oob:select: checking available component tcp
[borg01x143:13793] mca:oob:select: Querying component [tcp]
[borg01x143:13793] oob:tcp: component_available called
[borg01x143:13793] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x143:13793] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x143:13793] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.1.25.143 to our
list of V4 connections
[borg01x143:13793] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x143:13793] [[52298,0],1] oob:tcp:init adding 172.31.1.254 to our
list of V4 connections
[borg01x143:13793] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.12.25.143 to our
list of V4 connections
[borg01x143:13793] [[52298,0],1] TCP STARTUP
[borg01x143:13793] [[52298,0],1] attempting to bind to IPv4 port 0
[borg01x153:01290] mca:oob:select: Adding component to end
[borg01x153:01290] mca:oob:select: Found 1 active transports
[borg01x143:13793] [[52298,0],1] assigned IPv4 port 44719
[borg01x143:13793] mca:oob:select: Adding component to end
[borg01x143:13793] mca:oob:select: Found 1 active transports
[borg01x144:30878] mca: base: components_register: registering oob
components
[borg01x144:30878] mca: base: components_register: found loaded component
tcp
[borg01x144:30878] mca: base: components_register: component tcp register
function successful
[borg01x144:30878] mca: base: components_open: opening oob components
[borg01x144:30878] mca: base: components_open: found loaded component tcp
[borg01x144:30878] mca: base: components_open: component tcp open function
successful
[borg01x144:30878] mca:oob:select: checking available component tcp
[borg01x144:30878] mca:oob:select: Querying component [tcp]
[borg01x144:30878] oob:tcp: component_available called
[borg01x144:30878] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x144:30878] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x144:30878] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x144:30878] [[52298,0],2] oob:tcp:init adding 10.1.25.144 to our
list of V4 connections
[borg01x144:30878] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x144:30878] [[52298,0],2] oob:tcp:init adding 172.31.1.254 to our
list of V4 connections
[borg01x144:30878] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x144:30878] [[52298,0],2] oob:tcp:init adding 10.12.25.144 to our
list of V4 connections
[borg01x144:30878] [[52298,0],2] TCP STARTUP
[borg01x144:30878] [[52298,0],2] attempting to bind to IPv4 port 0
[borg01x144:30878] [[52298,0],2] assigned IPv4 port 40700
[borg01x144:30878] mca:oob:select: Adding component to end
[borg01x144:30878] mca:oob:select: Found 1 active transports
[borg01x154:01154] mca: base: components_register: registering oob
components
[borg01x154:01154] mca: base: components_register: found loaded component
tcp
[borg01x154:01154] mca: base: components_register: component tcp register
function successful
[borg01x154:01154] mca: base: components_open: opening oob components
[borg01x154:01154] mca: base: components_open: found loaded component tcp
[borg01x154:01154] mca: base: components_open: component tcp open function
successful
[borg01x154:01154] mca:oob:select: checking available component tcp
[borg01x154:01154] mca:oob:select: Querying component [tcp]
[borg01x154:01154] oob:tcp: component_available called
[borg01x154:01154] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x154:01154] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x154:01154] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x154:01154] [[52298,0],5] oob:tcp:init adding 10.1.25.154 to our
list of V4 connections
[borg01x154:01154] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x154:01154] [[52298,0],5] oob:tcp:init adding 172.31.1.254 to our
list of V4 connections
[borg01x154:01154] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x154:01154] [[52298,0],5] oob:tcp:init adding 10.12.25.154 to our
list of V4 connections
[borg01x154:01154] [[52298,0],5] TCP STARTUP
[borg01x154:01154] [[52298,0],5] attempting to bind to IPv4 port 0
[borg01x154:01154] [[52298,0],5] assigned IPv4 port 41191
[borg01x154:01154] mca:oob:select: Adding component to end
[borg01x154:01154] mca:oob:select: Found 1 active transports
[borg01x145:02419] mca: base: components_register: registering oob
components
[borg01x145:02419] mca: base: components_register: found loaded component
tcp
[borg01x145:02419] mca: base: components_register: component tcp register
function successful
[borg01x145:02419] mca: base: components_open: opening oob components
[borg01x145:02419] mca: base: components_open: found loaded component tcp
[borg01x145:02419] mca: base: components_open: component tcp open function
successful
[borg01x145:02419] mca:oob:select: checking available component tcp
[borg01x145:02419] mca:oob:select: Querying component [tcp]
[borg01x145:02419] oob:tcp: component_available called
[borg01x145:02419] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x145:02419] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x145:02419] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x145:02419] [[52298,0],3] oob:tcp:init adding 10.1.25.145 to our
list of V4 connections
[borg01x145:02419] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x145:02419] [[52298,0],3] oob:tcp:init adding 172.31.1.254 to our
list of V4 connections
[borg01x145:02419] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x145:02419] [[52298,0],3] oob:tcp:init adding 10.12.25.145 to our
list of V4 connections
[borg01x145:02419] [[52298,0],3] TCP STARTUP
[borg01x145:02419] [[52298,0],3] attempting to bind to IPv4 port 0
[borg01x145:02419] [[52298,0],3] assigned IPv4 port 51079
[borg01x145:02419] mca:oob:select: Adding component to end
[borg01x145:02419] mca:oob:select: Found 1 active transports
[borg01x144:30878] [[52298,0],2] ORTE_ERROR_LOG: Bad parameter in file
base/rml_base_contact.c at line 161
[borg01x144:30878] [[52298,0],2] ORTE_ERROR_LOG: Bad parameter in file
routed_binomial.c at line 498
[borg01x144:30878] [[52298,0],2] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 539
srun.slurm: error: borg01x143: task 0: Exited with exit code 213
srun.slurm: Terminating job step 2332583.24
slurmd[borg01x144]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH
SIGNAL 9 ***
srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to
finish.
srun.slurm: error: borg01x153: task 3: Exited with exit code 213
[borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in file
base/rml_base_contact.c at line 161
[borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in file
routed_binomial.c at line 498
[borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 539
[borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in file
base/rml_base_contact.c at line 161
[borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in file
routed_binomial.c at line 498
[borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 539
slurmd[borg01x144]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH
SIGNAL 9 ***
srun.slurm: error: borg01x144: task 1: Exited with exit code 213
[borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in file
base/rml_base_contact.c at line 161
[borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in file
routed_binomial.c at line 498
[borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 539
slurmd[borg01x154]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH
SIGNAL 9 ***
slurmd[borg01x154]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH
SIGNAL 9 ***
srun.slurm: error: borg01x154: task 4: Exited with exit code 213
srun.slurm: error: borg01x145: task 2: Exited with exit code 213
[borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in file
base/rml_base_contact.c at line 161
[borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in file
routed_binomial.c at line 498
[borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in file
base/ess_base_std_orted.c at line 539
slurmd[borg01x145]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH
SIGNAL 9 ***
slurmd[borg01x145]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH
SIGNAL 9 ***
sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:41686: No such file or
directory
[borg01x142:29232] [[52298,0],0] TCP SHUTDOWN
[borg01x142:29232] mca: base: close: component tcp closed
[borg01x142:29232] mca: base: close: unloading component tcp

Note, if I can get the allocation today, I want to try doing all this on a
single SandyBridge node, rather than on 6. It might make comparing various
runs a bit easier!

Matt



On Fri, Aug 29, 2014 at 12:42 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Okay, something quite weird is happening here. I can't replicate using the
> 1.8.2 release tarball on a slurm machine, so my guess is that something
> else is going on here.
>
> Could you please rebuild the 1.8.2 code with --enable-debug on the
> configure line (assuming you haven't already done so), and then rerun that
> version as before but adding "--mca oob_base_verbose 10" to the cmd line?
>
>
> On Aug 29, 2014, at 4:22 AM, Matt Thompson <fort...@gmail.com> wrote:
>
> Ralph,
>
> For 1.8.2rc4 I get:
>
> (1003) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun
> --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x
> srun.slurm: cluster configuration lacks support for cpu binding
> srun.slurm: cluster configuration lacks support for cpu binding
> Daemon [[47143,0],5] checking in as pid 10990 on host borg01x154
> [borg01x154:10990] [[47143,0],5] orted: up and running - waiting for
> commands!
> Daemon [[47143,0],1] checking in as pid 23473 on host borg01x143
> Daemon [[47143,0],2] checking in as pid 8250 on host borg01x144
> [borg01x144:08250] [[47143,0],2] orted: up and running - waiting for
> commands!
> [borg01x143:23473] [[47143,0],1] orted: up and running - waiting for
> commands!
> Daemon [[47143,0],3] checking in as pid 12320 on host borg01x145
> Daemon [[47143,0],4] checking in as pid 10902 on host borg01x153
> [borg01x153:10902] [[47143,0],4] orted: up and running - waiting for
> commands!
> [borg01x145:12320] [[47143,0],3] orted: up and running - waiting for
> commands!
> [borg01x142:01629] [[47143,0],0] orted_cmd: received add_local_procs
> [borg01x144:08250] [[47143,0],2] orted_cmd: received add_local_procs
> [borg01x153:10902] [[47143,0],4] orted_cmd: received add_local_procs
> [borg01x143:23473] [[47143,0],1] orted_cmd: received add_local_procs
> [borg01x145:12320] [[47143,0],3] orted_cmd: received add_local_procs
> [borg01x154:10990] [[47143,0],5] orted_cmd: received add_local_procs
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
> local proc [[47143,1],0]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
> local proc [[47143,1],2]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
> local proc [[47143,1],3]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
> local proc [[47143,1],1]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
> local proc [[47143,1],5]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
> local proc [[47143,1],4]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
> local proc [[47143,1],6]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
> local proc [[47143,1],7]
>   MPIR_being_debugged = 0
>   MPIR_debug_state = 1
>   MPIR_partial_attach_ok = 1
>   MPIR_i_am_starter = 0
>   MPIR_forward_output = 0
>   MPIR_proctable_size = 8
>   MPIR_proctable:
>     (i, host, exe, pid) = (0, borg01x142,
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1647)
>     (i, host, exe, pid) = (1, borg01x142,
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1648)
>     (i, host, exe, pid) = (2, borg01x142,
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1650)
>     (i, host, exe, pid) = (3, borg01x142,
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1652)
>     (i, host, exe, pid) = (4, borg01x142,
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1654)
>     (i, host, exe, pid) = (5, borg01x142,
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1656)
>     (i, host, exe, pid) = (6, borg01x142,
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1658)
>     (i, host, exe, pid) = (7, borg01x142,
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1660)
> MPIR_executable_path: NULL
> MPIR_server_arguments: NULL
> [borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs
> [borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs
> [borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs
> [borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs
> [borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs
> [borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs
> [borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs
> [borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs
> [borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs
> [borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs
> [borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs
> Process    2 of    8 is on borg01x142
> Process    5 of    8 is on borg01x142
> Process    4 of    8 is on borg01x142
> Process    1 of    8 is on borg01x142
> Process    0 of    8 is on borg01x142
> Process    3 of    8 is on borg01x142
> Process    6 of    8 is on borg01x142
> Process    7 of    8 is on borg01x142
> [borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs
> [borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs
> [borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs
> [borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs
> [borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs
> [borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs
> [borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc
> [[47143,1],2]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc
> [[47143,1],1]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc
> [[47143,1],3]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc
> [[47143,1],0]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc
> [[47143,1],4]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc
> [[47143,1],6]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc
> [[47143,1],5]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc
> [[47143,1],7]
> [borg01x142:01629] [[47143,0],0] orted_cmd: received exit cmd
> [borg01x144:08250] [[47143,0],2] orted_cmd: received exit cmd
> [borg01x144:08250] [[47143,0],2] orted_cmd: all routes and children gone -
> exiting
> [borg01x153:10902] [[47143,0],4] orted_cmd: received exit cmd
> [borg01x153:10902] [[47143,0],4] orted_cmd: all routes and children gone -
> exiting
> [borg01x143:23473] [[47143,0],1] orted_cmd: received exit cmd
> [borg01x154:10990] [[47143,0],5] orted_cmd: received exit cmd
> [borg01x154:10990] [[47143,0],5] orted_cmd: all routes and children gone -
> exiting
> [borg01x145:12320] [[47143,0],3] orted_cmd: received exit cmd
> [borg01x145:12320] [[47143,0],3] orted_cmd: all routes and children gone -
> exiting
>
> Using the 1.8.2 mpirun:
>
> (1004) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun
> --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x
> srun.slurm: cluster configuration lacks support for cpu binding
> srun.slurm: cluster configuration lacks support for cpu binding
> [borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in file
> base/rml_base_contact.c at line 161
> [borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in file
> routed_binomial.c at line 498
> [borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in file
> base/ess_base_std_orted.c at line 539
> srun.slurm: error: borg01x143: task 0: Exited with exit code 213
> srun.slurm: Terminating job step 2332583.4
> [borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in file
> base/rml_base_contact.c at line 161
> [borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in file
> routed_binomial.c at line 498
> [borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in file
> base/ess_base_std_orted.c at line 539
> [borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in file
> base/rml_base_contact.c at line 161
> [borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in file
> routed_binomial.c at line 498
> [borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in file
> base/ess_base_std_orted.c at line 539
> srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to
> finish.
> slurmd[borg01x145]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH
> SIGNAL 9 ***
> slurmd[borg01x154]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH
> SIGNAL 9 ***
> slurmd[borg01x153]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH
> SIGNAL 9 ***
> slurmd[borg01x153]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH
> SIGNAL 9 ***
> srun.slurm: error: borg01x144: task 1: Exited with exit code 213
> slurmd[borg01x144]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH
> SIGNAL 9 ***
> slurmd[borg01x144]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH
> SIGNAL 9 ***
> srun.slurm: error: borg01x153: task 3: Exited with exit code 213
> slurmd[borg01x154]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH
> SIGNAL 9 ***
> slurmd[borg01x145]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH
> SIGNAL 9 ***
> srun.slurm: error: borg01x154: task 4: Killed
> srun.slurm: error: borg01x145: task 2: Killed
> sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:34169: No such file or
> directory
>
>
>
>
> On Thu, Aug 28, 2014 at 7:17 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> I'm unaware of any changes to the Slurm integration between rc4 and final
>> release. It sounds like this might be something else going on - try adding
>> "--leave-session-attached --debug-daemons" to your 1.8.2 command line and
>> let's see if any errors get reported.
>>
>>
>> On Aug 28, 2014, at 12:20 PM, Matt Thompson <fort...@gmail.com> wrote:
>>
>> Open MPI List,
>>
>> I recently encountered an odd bug with Open MPI 1.8.1 and GCC 4.9.1 on
>> our cluster (reported on this list), and decided to try it with 1.8.2.
>> However, we seem to be having an issue with Open MPI 1.8.2 and SLURM. Even
>> weirder, Open MPI 1.8.2rc4 doesn't show the bug. And the bug is: I get no
>> stdout with Open MPI 1.8.2. That is, HelloWorld doesn't work.
>>
>> To wit, our sysadmin has two tarballs:
>>
>> (1441) $ sha1sum openmpi-1.8.2rc4.tar.bz2
>> 7e7496913c949451f546f22a1a159df25f8bb683  openmpi-1.8.2rc4.tar.bz2
>> (1442) $ sha1sum openmpi-1.8.2.tar.gz
>> cf2b1e45575896f63367406c6c50574699d8b2e1  openmpi-1.8.2.tar.gz
>>
>> I then build each with a script in the method our sysadmin usually does:
>>
>> #!/bin/sh
>>> set -x
>>> export PREFIX=/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2
>>> export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/nlocal/slurm/2.6.3/lib64
>>> build() {
>>>   echo `pwd`
>>>   ./configure --with-slurm --disable-wrapper-rpath --enable-shared
>>> --enable-mca-no-build=btl-usnic \
>>>       CC=gcc CXX=g++ F77=gfortran FC=gfortran \
>>>       CFLAGS="-mtune=generic -fPIC -m64" CXXFLAGS="-mtune=generic -fPIC
>>> -m64" FFLAGS="-mtune=generic -fPIC -m64" \
>>>       F77FLAGS="-mtune=generic -fPIC -m64" FCFLAGS="-mtune=generic -fPIC
>>> -m64" F90FLAGS="-mtune=generic -fPIC -m64" \
>>>       LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64"
>>> CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include" LIBS="-lpciaccess" \
>>>      --prefix=${PREFIX} 2>&1 | tee configure.1.8.2.log
>>>   make 2>&1 | tee make.1.8.2.log
>>>   make check 2>&1 | tee makecheck.1.8.2.log
>>>   make install 2>&1 | tee makeinstall.1.8.2.log
>>> }
>>> echo "calling build"
>>> build
>>> echo "exiting"
>>
>>
>> The only difference between the two is '1.8.2' or '1.8.2rc4' in the
>> PREFIX and log file tees.  Now, let us test. First, I grab some nodes with
>> slurm:
>>
>> $ salloc --nodes=6 --ntasks-per-node=16 --constraint=sand --time=09:00:00
>>> --account=g0620 --mail-type=BEGIN
>>
>>
>> Once I get my nodes, I run with 1.8.2rc4:
>>
>> (1142) $
>>> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpifort -o
>>> helloWorld.182rc4.x helloWorld.F90
>>> (1143) $
>>> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8
>>> ./helloWorld.182rc4.x
>>> Process    0 of    8 is on borg01w044
>>> Process    5 of    8 is on borg01w044
>>> Process    3 of    8 is on borg01w044
>>> Process    7 of    8 is on borg01w044
>>> Process    1 of    8 is on borg01w044
>>> Process    2 of    8 is on borg01w044
>>> Process    4 of    8 is on borg01w044
>>> Process    6 of    8 is on borg01w044
>>
>>
>> Now 1.8.2:
>>
>> (1144) $
>>> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpifort -o
>>> helloWorld.182.x helloWorld.F90
>>> (1145) $
>>> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun -np 8
>>> ./helloWorld.182.x
>>> (1146) $
>>
>>
>> No output at all. But, if I take the helloWorld.x from 1.8.2 and run it
>> with 1.8.2rc4's mpirun:
>>
>> (1146) $
>>> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8
>>> ./helloWorld.182.x
>>> Process    5 of    8 is on borg01w044
>>> Process    7 of    8 is on borg01w044
>>> Process    2 of    8 is on borg01w044
>>> Process    4 of    8 is on borg01w044
>>> Process    1 of    8 is on borg01w044
>>> Process    3 of    8 is on borg01w044
>>> Process    6 of    8 is on borg01w044
>>> Process    0 of    8 is on borg01w044
>>
>>
>> So...any idea what is happening here? There did seem to be a few SLURM
>> related changes between the two tarballs involving /dev/null but it's a bit
>> above me to decipher.
>>
>> You can find the ompi_info, build, make, config, etc logs at these links
>> (they are ~300kB which is over the mailing list limit according to the Open
>> MPI web page):
>>
>> https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2rc4-Output.tar.bz2
>> https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2-Output.tar.bz2
>>
>> Thank you for any help and please let me know if you need more
>> information,
>> Matt
>>
>> --
>> "And, isn't sanity really just a one-trick pony anyway? I mean all you
>>  get is one trick: rational thinking. But when you're good and crazy,
>>  oooh, oooh, oooh, the sky is the limit!" -- The Tick
>>
>>  _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/08/25182.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/08/25184.php
>>
>
>
>
> --
> "And, isn't sanity really just a one-trick pony anyway? I mean all you
>  get is one trick: rational thinking. But when you're good and crazy,
>  oooh, oooh, oooh, the sky is the limit!" -- The Tick
>
>  _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25187.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25193.php
>



-- 
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy,
 oooh, oooh, oooh, the sky is the limit!" -- The Tick

Reply via email to