Rats - I also need "-mca plm_base_verbose 5" on there so I can see the cmd line being executed. Can you add it?
On Aug 29, 2014, at 11:16 AM, Matt Thompson <fort...@gmail.com> wrote: > Ralph, > > Here you go: > > (1080) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun > --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -np 8 > ./helloWorld.182-debug.x > [borg01x142:29232] mca: base: components_register: registering oob components > [borg01x142:29232] mca: base: components_register: found loaded component tcp > [borg01x142:29232] mca: base: components_register: component tcp register > function successful > [borg01x142:29232] mca: base: components_open: opening oob components > [borg01x142:29232] mca: base: components_open: found loaded component tcp > [borg01x142:29232] mca: base: components_open: component tcp open function > successful > [borg01x142:29232] mca:oob:select: checking available component tcp > [borg01x142:29232] mca:oob:select: Querying component [tcp] > [borg01x142:29232] oob:tcp: component_available called > [borg01x142:29232] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [borg01x142:29232] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > [borg01x142:29232] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.1.25.142 to our list > of V4 connections > [borg01x142:29232] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 172.31.1.254 to our list > of V4 connections > [borg01x142:29232] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.12.25.142 to our list > of V4 connections > [borg01x142:29232] [[52298,0],0] TCP STARTUP > [borg01x142:29232] [[52298,0],0] attempting to bind to IPv4 port 0 > [borg01x142:29232] [[52298,0],0] assigned IPv4 port 41686 > [borg01x142:29232] mca:oob:select: Adding component to end > [borg01x142:29232] mca:oob:select: Found 1 active transports > srun.slurm: cluster configuration lacks support for cpu binding > srun.slurm: cluster configuration lacks support for cpu binding > [borg01x153:01290] mca: base: components_register: registering oob components > [borg01x153:01290] mca: base: components_register: found loaded component tcp > [borg01x143:13793] mca: base: components_register: registering oob components > [borg01x143:13793] mca: base: components_register: found loaded component tcp > [borg01x153:01290] mca: base: components_register: component tcp register > function successful > [borg01x153:01290] mca: base: components_open: opening oob components > [borg01x153:01290] mca: base: components_open: found loaded component tcp > [borg01x153:01290] mca: base: components_open: component tcp open function > successful > [borg01x153:01290] mca:oob:select: checking available component tcp > [borg01x153:01290] mca:oob:select: Querying component [tcp] > [borg01x153:01290] oob:tcp: component_available called > [borg01x153:01290] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [borg01x153:01290] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > [borg01x153:01290] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > [borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.1.25.153 to our list > of V4 connections > [borg01x153:01290] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > [borg01x153:01290] [[52298,0],4] oob:tcp:init adding 172.31.1.254 to our list > of V4 connections > [borg01x153:01290] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > [borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.12.25.153 to our list > of V4 connections > [borg01x153:01290] [[52298,0],4] TCP STARTUP > [borg01x153:01290] [[52298,0],4] attempting to bind to IPv4 port 0 > [borg01x143:13793] mca: base: components_register: component tcp register > function successful > [borg01x153:01290] [[52298,0],4] assigned IPv4 port 38028 > [borg01x143:13793] mca: base: components_open: opening oob components > [borg01x143:13793] mca: base: components_open: found loaded component tcp > [borg01x143:13793] mca: base: components_open: component tcp open function > successful > [borg01x143:13793] mca:oob:select: checking available component tcp > [borg01x143:13793] mca:oob:select: Querying component [tcp] > [borg01x143:13793] oob:tcp: component_available called > [borg01x143:13793] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [borg01x143:13793] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > [borg01x143:13793] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > [borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.1.25.143 to our list > of V4 connections > [borg01x143:13793] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > [borg01x143:13793] [[52298,0],1] oob:tcp:init adding 172.31.1.254 to our list > of V4 connections > [borg01x143:13793] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > [borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.12.25.143 to our list > of V4 connections > [borg01x143:13793] [[52298,0],1] TCP STARTUP > [borg01x143:13793] [[52298,0],1] attempting to bind to IPv4 port 0 > [borg01x153:01290] mca:oob:select: Adding component to end > [borg01x153:01290] mca:oob:select: Found 1 active transports > [borg01x143:13793] [[52298,0],1] assigned IPv4 port 44719 > [borg01x143:13793] mca:oob:select: Adding component to end > [borg01x143:13793] mca:oob:select: Found 1 active transports > [borg01x144:30878] mca: base: components_register: registering oob components > [borg01x144:30878] mca: base: components_register: found loaded component tcp > [borg01x144:30878] mca: base: components_register: component tcp register > function successful > [borg01x144:30878] mca: base: components_open: opening oob components > [borg01x144:30878] mca: base: components_open: found loaded component tcp > [borg01x144:30878] mca: base: components_open: component tcp open function > successful > [borg01x144:30878] mca:oob:select: checking available component tcp > [borg01x144:30878] mca:oob:select: Querying component [tcp] > [borg01x144:30878] oob:tcp: component_available called > [borg01x144:30878] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [borg01x144:30878] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > [borg01x144:30878] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > [borg01x144:30878] [[52298,0],2] oob:tcp:init adding 10.1.25.144 to our list > of V4 connections > [borg01x144:30878] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > [borg01x144:30878] [[52298,0],2] oob:tcp:init adding 172.31.1.254 to our list > of V4 connections > [borg01x144:30878] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > [borg01x144:30878] [[52298,0],2] oob:tcp:init adding 10.12.25.144 to our list > of V4 connections > [borg01x144:30878] [[52298,0],2] TCP STARTUP > [borg01x144:30878] [[52298,0],2] attempting to bind to IPv4 port 0 > [borg01x144:30878] [[52298,0],2] assigned IPv4 port 40700 > [borg01x144:30878] mca:oob:select: Adding component to end > [borg01x144:30878] mca:oob:select: Found 1 active transports > [borg01x154:01154] mca: base: components_register: registering oob components > [borg01x154:01154] mca: base: components_register: found loaded component tcp > [borg01x154:01154] mca: base: components_register: component tcp register > function successful > [borg01x154:01154] mca: base: components_open: opening oob components > [borg01x154:01154] mca: base: components_open: found loaded component tcp > [borg01x154:01154] mca: base: components_open: component tcp open function > successful > [borg01x154:01154] mca:oob:select: checking available component tcp > [borg01x154:01154] mca:oob:select: Querying component [tcp] > [borg01x154:01154] oob:tcp: component_available called > [borg01x154:01154] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [borg01x154:01154] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > [borg01x154:01154] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > [borg01x154:01154] [[52298,0],5] oob:tcp:init adding 10.1.25.154 to our list > of V4 connections > [borg01x154:01154] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > [borg01x154:01154] [[52298,0],5] oob:tcp:init adding 172.31.1.254 to our list > of V4 connections > [borg01x154:01154] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > [borg01x154:01154] [[52298,0],5] oob:tcp:init adding 10.12.25.154 to our list > of V4 connections > [borg01x154:01154] [[52298,0],5] TCP STARTUP > [borg01x154:01154] [[52298,0],5] attempting to bind to IPv4 port 0 > [borg01x154:01154] [[52298,0],5] assigned IPv4 port 41191 > [borg01x154:01154] mca:oob:select: Adding component to end > [borg01x154:01154] mca:oob:select: Found 1 active transports > [borg01x145:02419] mca: base: components_register: registering oob components > [borg01x145:02419] mca: base: components_register: found loaded component tcp > [borg01x145:02419] mca: base: components_register: component tcp register > function successful > [borg01x145:02419] mca: base: components_open: opening oob components > [borg01x145:02419] mca: base: components_open: found loaded component tcp > [borg01x145:02419] mca: base: components_open: component tcp open function > successful > [borg01x145:02419] mca:oob:select: checking available component tcp > [borg01x145:02419] mca:oob:select: Querying component [tcp] > [borg01x145:02419] oob:tcp: component_available called > [borg01x145:02419] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [borg01x145:02419] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > [borg01x145:02419] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > [borg01x145:02419] [[52298,0],3] oob:tcp:init adding 10.1.25.145 to our list > of V4 connections > [borg01x145:02419] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > [borg01x145:02419] [[52298,0],3] oob:tcp:init adding 172.31.1.254 to our list > of V4 connections > [borg01x145:02419] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > [borg01x145:02419] [[52298,0],3] oob:tcp:init adding 10.12.25.145 to our list > of V4 connections > [borg01x145:02419] [[52298,0],3] TCP STARTUP > [borg01x145:02419] [[52298,0],3] attempting to bind to IPv4 port 0 > [borg01x145:02419] [[52298,0],3] assigned IPv4 port 51079 > [borg01x145:02419] mca:oob:select: Adding component to end > [borg01x145:02419] mca:oob:select: Found 1 active transports > [borg01x144:30878] [[52298,0],2] ORTE_ERROR_LOG: Bad parameter in file > base/rml_base_contact.c at line 161 > [borg01x144:30878] [[52298,0],2] ORTE_ERROR_LOG: Bad parameter in file > routed_binomial.c at line 498 > [borg01x144:30878] [[52298,0],2] ORTE_ERROR_LOG: Bad parameter in file > base/ess_base_std_orted.c at line 539 > srun.slurm: error: borg01x143: task 0: Exited with exit code 213 > srun.slurm: Terminating job step 2332583.24 > slurmd[borg01x144]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH > SIGNAL 9 *** > srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to finish. > srun.slurm: error: borg01x153: task 3: Exited with exit code 213 > [borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in file > base/rml_base_contact.c at line 161 > [borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in file > routed_binomial.c at line 498 > [borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in file > base/ess_base_std_orted.c at line 539 > [borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in file > base/rml_base_contact.c at line 161 > [borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in file > routed_binomial.c at line 498 > [borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in file > base/ess_base_std_orted.c at line 539 > slurmd[borg01x144]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH > SIGNAL 9 *** > srun.slurm: error: borg01x144: task 1: Exited with exit code 213 > [borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in file > base/rml_base_contact.c at line 161 > [borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in file > routed_binomial.c at line 498 > [borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in file > base/ess_base_std_orted.c at line 539 > slurmd[borg01x154]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH > SIGNAL 9 *** > slurmd[borg01x154]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH > SIGNAL 9 *** > srun.slurm: error: borg01x154: task 4: Exited with exit code 213 > srun.slurm: error: borg01x145: task 2: Exited with exit code 213 > [borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in file > base/rml_base_contact.c at line 161 > [borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in file > routed_binomial.c at line 498 > [borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in file > base/ess_base_std_orted.c at line 539 > slurmd[borg01x145]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH > SIGNAL 9 *** > slurmd[borg01x145]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH > SIGNAL 9 *** > sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:41686: No such file or > directory > [borg01x142:29232] [[52298,0],0] TCP SHUTDOWN > [borg01x142:29232] mca: base: close: component tcp closed > [borg01x142:29232] mca: base: close: unloading component tcp > > Note, if I can get the allocation today, I want to try doing all this on a > single SandyBridge node, rather than on 6. It might make comparing various > runs a bit easier! > > Matt > > > > On Fri, Aug 29, 2014 at 12:42 PM, Ralph Castain <r...@open-mpi.org> wrote: > Okay, something quite weird is happening here. I can't replicate using the > 1.8.2 release tarball on a slurm machine, so my guess is that something else > is going on here. > > Could you please rebuild the 1.8.2 code with --enable-debug on the configure > line (assuming you haven't already done so), and then rerun that version as > before but adding "--mca oob_base_verbose 10" to the cmd line? > > > On Aug 29, 2014, at 4:22 AM, Matt Thompson <fort...@gmail.com> wrote: > >> Ralph, >> >> For 1.8.2rc4 I get: >> >> (1003) $ >> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun >> --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x >> srun.slurm: cluster configuration lacks support for cpu binding >> srun.slurm: cluster configuration lacks support for cpu binding >> Daemon [[47143,0],5] checking in as pid 10990 on host borg01x154 >> [borg01x154:10990] [[47143,0],5] orted: up and running - waiting for >> commands! >> Daemon [[47143,0],1] checking in as pid 23473 on host borg01x143 >> Daemon [[47143,0],2] checking in as pid 8250 on host borg01x144 >> [borg01x144:08250] [[47143,0],2] orted: up and running - waiting for >> commands! >> [borg01x143:23473] [[47143,0],1] orted: up and running - waiting for >> commands! >> Daemon [[47143,0],3] checking in as pid 12320 on host borg01x145 >> Daemon [[47143,0],4] checking in as pid 10902 on host borg01x153 >> [borg01x153:10902] [[47143,0],4] orted: up and running - waiting for >> commands! >> [borg01x145:12320] [[47143,0],3] orted: up and running - waiting for >> commands! >> [borg01x142:01629] [[47143,0],0] orted_cmd: received add_local_procs >> [borg01x144:08250] [[47143,0],2] orted_cmd: received add_local_procs >> [borg01x153:10902] [[47143,0],4] orted_cmd: received add_local_procs >> [borg01x143:23473] [[47143,0],1] orted_cmd: received add_local_procs >> [borg01x145:12320] [[47143,0],3] orted_cmd: received add_local_procs >> [borg01x154:10990] [[47143,0],5] orted_cmd: received add_local_procs >> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local >> proc [[47143,1],0] >> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local >> proc [[47143,1],2] >> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local >> proc [[47143,1],3] >> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local >> proc [[47143,1],1] >> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local >> proc [[47143,1],5] >> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local >> proc [[47143,1],4] >> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local >> proc [[47143,1],6] >> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local >> proc [[47143,1],7] >> MPIR_being_debugged = 0 >> MPIR_debug_state = 1 >> MPIR_partial_attach_ok = 1 >> MPIR_i_am_starter = 0 >> MPIR_forward_output = 0 >> MPIR_proctable_size = 8 >> MPIR_proctable: >> (i, host, exe, pid) = (0, borg01x142, >> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1647) >> (i, host, exe, pid) = (1, borg01x142, >> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1648) >> (i, host, exe, pid) = (2, borg01x142, >> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1650) >> (i, host, exe, pid) = (3, borg01x142, >> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1652) >> (i, host, exe, pid) = (4, borg01x142, >> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1654) >> (i, host, exe, pid) = (5, borg01x142, >> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1656) >> (i, host, exe, pid) = (6, borg01x142, >> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1658) >> (i, host, exe, pid) = (7, borg01x142, >> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1660) >> MPIR_executable_path: NULL >> MPIR_server_arguments: NULL >> [borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs >> [borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs >> [borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs >> [borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs >> [borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs >> [borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs >> [borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs >> [borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs >> [borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs >> [borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs >> [borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs >> Process 2 of 8 is on borg01x142 >> Process 5 of 8 is on borg01x142 >> Process 4 of 8 is on borg01x142 >> Process 1 of 8 is on borg01x142 >> Process 0 of 8 is on borg01x142 >> Process 3 of 8 is on borg01x142 >> Process 6 of 8 is on borg01x142 >> Process 7 of 8 is on borg01x142 >> [borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs >> [borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs >> [borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs >> [borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs >> [borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs >> [borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs >> [borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs >> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc >> [[47143,1],2] >> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc >> [[47143,1],1] >> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc >> [[47143,1],3] >> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc >> [[47143,1],0] >> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc >> [[47143,1],4] >> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc >> [[47143,1],6] >> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc >> [[47143,1],5] >> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc >> [[47143,1],7] >> [borg01x142:01629] [[47143,0],0] orted_cmd: received exit cmd >> [borg01x144:08250] [[47143,0],2] orted_cmd: received exit cmd >> [borg01x144:08250] [[47143,0],2] orted_cmd: all routes and children gone - >> exiting >> [borg01x153:10902] [[47143,0],4] orted_cmd: received exit cmd >> [borg01x153:10902] [[47143,0],4] orted_cmd: all routes and children gone - >> exiting >> [borg01x143:23473] [[47143,0],1] orted_cmd: received exit cmd >> [borg01x154:10990] [[47143,0],5] orted_cmd: received exit cmd >> [borg01x154:10990] [[47143,0],5] orted_cmd: all routes and children gone - >> exiting >> [borg01x145:12320] [[47143,0],3] orted_cmd: received exit cmd >> [borg01x145:12320] [[47143,0],3] orted_cmd: all routes and children gone - >> exiting >> >> Using the 1.8.2 mpirun: >> >> (1004) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun >> --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x >> srun.slurm: cluster configuration lacks support for cpu binding >> srun.slurm: cluster configuration lacks support for cpu binding >> [borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in file >> base/rml_base_contact.c at line 161 >> [borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in file >> routed_binomial.c at line 498 >> [borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in file >> base/ess_base_std_orted.c at line 539 >> srun.slurm: error: borg01x143: task 0: Exited with exit code 213 >> srun.slurm: Terminating job step 2332583.4 >> [borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in file >> base/rml_base_contact.c at line 161 >> [borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in file >> routed_binomial.c at line 498 >> [borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in file >> base/ess_base_std_orted.c at line 539 >> [borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in file >> base/rml_base_contact.c at line 161 >> [borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in file >> routed_binomial.c at line 498 >> [borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in file >> base/ess_base_std_orted.c at line 539 >> srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to finish. >> slurmd[borg01x145]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH >> SIGNAL 9 *** >> slurmd[borg01x154]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH >> SIGNAL 9 *** >> slurmd[borg01x153]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH >> SIGNAL 9 *** >> slurmd[borg01x153]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH >> SIGNAL 9 *** >> srun.slurm: error: borg01x144: task 1: Exited with exit code 213 >> slurmd[borg01x144]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH >> SIGNAL 9 *** >> slurmd[borg01x144]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH >> SIGNAL 9 *** >> srun.slurm: error: borg01x153: task 3: Exited with exit code 213 >> slurmd[borg01x154]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH >> SIGNAL 9 *** >> slurmd[borg01x145]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH >> SIGNAL 9 *** >> srun.slurm: error: borg01x154: task 4: Killed >> srun.slurm: error: borg01x145: task 2: Killed >> sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:34169: No such file or >> directory >> >> >> >> >> On Thu, Aug 28, 2014 at 7:17 PM, Ralph Castain <r...@open-mpi.org> wrote: >> I'm unaware of any changes to the Slurm integration between rc4 and final >> release. It sounds like this might be something else going on - try adding >> "--leave-session-attached --debug-daemons" to your 1.8.2 command line and >> let's see if any errors get reported. >> >> >> On Aug 28, 2014, at 12:20 PM, Matt Thompson <fort...@gmail.com> wrote: >> >>> Open MPI List, >>> >>> I recently encountered an odd bug with Open MPI 1.8.1 and GCC 4.9.1 on our >>> cluster (reported on this list), and decided to try it with 1.8.2. However, >>> we seem to be having an issue with Open MPI 1.8.2 and SLURM. Even weirder, >>> Open MPI 1.8.2rc4 doesn't show the bug. And the bug is: I get no stdout >>> with Open MPI 1.8.2. That is, HelloWorld doesn't work. >>> >>> To wit, our sysadmin has two tarballs: >>> >>> (1441) $ sha1sum openmpi-1.8.2rc4.tar.bz2 >>> 7e7496913c949451f546f22a1a159df25f8bb683 openmpi-1.8.2rc4.tar.bz2 >>> (1442) $ sha1sum openmpi-1.8.2.tar.gz >>> cf2b1e45575896f63367406c6c50574699d8b2e1 openmpi-1.8.2.tar.gz >>> >>> I then build each with a script in the method our sysadmin usually does: >>> >>> #!/bin/sh >>> set -x >>> export PREFIX=/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2 >>> export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/nlocal/slurm/2.6.3/lib64 >>> build() { >>> echo `pwd` >>> ./configure --with-slurm --disable-wrapper-rpath --enable-shared >>> --enable-mca-no-build=btl-usnic \ >>> CC=gcc CXX=g++ F77=gfortran FC=gfortran \ >>> CFLAGS="-mtune=generic -fPIC -m64" CXXFLAGS="-mtune=generic -fPIC >>> -m64" FFLAGS="-mtune=generic -fPIC -m64" \ >>> F77FLAGS="-mtune=generic -fPIC -m64" FCFLAGS="-mtune=generic -fPIC >>> -m64" F90FLAGS="-mtune=generic -fPIC -m64" \ >>> LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64" >>> CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include" LIBS="-lpciaccess" \ >>> --prefix=${PREFIX} 2>&1 | tee configure.1.8.2.log >>> make 2>&1 | tee make.1.8.2.log >>> make check 2>&1 | tee makecheck.1.8.2.log >>> make install 2>&1 | tee makeinstall.1.8.2.log >>> } >>> echo "calling build" >>> build >>> echo "exiting" >>> >>> The only difference between the two is '1.8.2' or '1.8.2rc4' in the PREFIX >>> and log file tees. Now, let us test. First, I grab some nodes with slurm: >>> >>> $ salloc --nodes=6 --ntasks-per-node=16 --constraint=sand --time=09:00:00 >>> --account=g0620 --mail-type=BEGIN >>> >>> Once I get my nodes, I run with 1.8.2rc4: >>> >>> (1142) $ >>> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpifort -o >>> helloWorld.182rc4.x helloWorld.F90 >>> (1143) $ >>> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8 >>> ./helloWorld.182rc4.x >>> Process 0 of 8 is on borg01w044 >>> Process 5 of 8 is on borg01w044 >>> Process 3 of 8 is on borg01w044 >>> Process 7 of 8 is on borg01w044 >>> Process 1 of 8 is on borg01w044 >>> Process 2 of 8 is on borg01w044 >>> Process 4 of 8 is on borg01w044 >>> Process 6 of 8 is on borg01w044 >>> >>> Now 1.8.2: >>> >>> (1144) $ >>> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpifort -o >>> helloWorld.182.x helloWorld.F90 >>> (1145) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun >>> -np 8 ./helloWorld.182.x >>> (1146) $ >>> >>> No output at all. But, if I take the helloWorld.x from 1.8.2 and run it >>> with 1.8.2rc4's mpirun: >>> >>> (1146) $ >>> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8 >>> ./helloWorld.182.x >>> Process 5 of 8 is on borg01w044 >>> Process 7 of 8 is on borg01w044 >>> Process 2 of 8 is on borg01w044 >>> Process 4 of 8 is on borg01w044 >>> Process 1 of 8 is on borg01w044 >>> Process 3 of 8 is on borg01w044 >>> Process 6 of 8 is on borg01w044 >>> Process 0 of 8 is on borg01w044 >>> >>> So...any idea what is happening here? There did seem to be a few SLURM >>> related changes between the two tarballs involving /dev/null but it's a bit >>> above me to decipher. >>> >>> You can find the ompi_info, build, make, config, etc logs at these links >>> (they are ~300kB which is over the mailing list limit according to the Open >>> MPI web page): >>> >>> https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2rc4-Output.tar.bz2 >>> https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2-Output.tar.bz2 >>> >>> Thank you for any help and please let me know if you need more information, >>> Matt >>> >>> -- >>> "And, isn't sanity really just a one-trick pony anyway? I mean all you >>> get is one trick: rational thinking. But when you're good and crazy, >>> oooh, oooh, oooh, the sky is the limit!" -- The Tick >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/08/25182.php >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/08/25184.php >> >> >> >> -- >> "And, isn't sanity really just a one-trick pony anyway? I mean all you >> get is one trick: rational thinking. But when you're good and crazy, >> oooh, oooh, oooh, the sky is the limit!" -- The Tick >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/08/25187.php > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25193.php > > > > -- > "And, isn't sanity really just a one-trick pony anyway? I mean all you > get is one trick: rational thinking. But when you're good and crazy, > oooh, oooh, oooh, the sky is the limit!" -- The Tick > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25196.php