Hmmm....I may see the problem. Would you be so kind as to apply the attached patch to your 1.8.2 code, rebuild, and try again?

Much appreciate the help. Everyone's system is slightly different, and I think you've uncovered one of those differences.
Ralph

Attachment: uri.diff
Description: Binary data


On Aug 31, 2014, at 6:25 AM, Matt Thompson <fort...@gmail.com> wrote:

Ralph,

Sorry it took me a bit of time. Here you go:

(1002) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x
[borg01w063:03815] mca:base:select:(  plm) Querying component [isolated]
[borg01w063:03815] mca:base:select:(  plm) Query of component [isolated] set priority to 0
[borg01w063:03815] mca:base:select:(  plm) Querying component [rsh]
[borg01w063:03815] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[borg01w063:03815] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[borg01w063:03815] mca:base:select:(  plm) Querying component [slurm]
[borg01w063:03815] [[INVALID],INVALID] plm:slurm: available for selection
[borg01w063:03815] mca:base:select:(  plm) Query of component [slurm] set priority to 75
[borg01w063:03815] mca:base:select:(  plm) Selected component [slurm]
[borg01w063:03815] plm:base:set_hnp_name: initial bias 3815 nodename hash 1757783593
[borg01w063:03815] plm:base:set_hnp_name: final jobfam 49163
[borg01w063:03815] mca: base: components_register: registering oob components
[borg01w063:03815] mca: base: components_register: found loaded component tcp
[borg01w063:03815] mca: base: components_register: component tcp register function successful
[borg01w063:03815] mca: base: components_open: opening oob components
[borg01w063:03815] mca: base: components_open: found loaded component tcp
[borg01w063:03815] mca: base: components_open: component tcp open function successful
[borg01w063:03815] mca:oob:select: checking available component tcp
[borg01w063:03815] mca:oob:select: Querying component [tcp]
[borg01w063:03815] oob:tcp: component_available called
[borg01w063:03815] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01w063:03815] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01w063:03815] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.1.24.63 to our list of V4 connections
[borg01w063:03815] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01w063:03815] [[49163,0],0] oob:tcp:init adding 172.31.1.254 to our list of V4 connections
[borg01w063:03815] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.12.24.63 to our list of V4 connections
[borg01w063:03815] [[49163,0],0] TCP STARTUP
[borg01w063:03815] [[49163,0],0] attempting to bind to IPv4 port 0
[borg01w063:03815] [[49163,0],0] assigned IPv4 port 41373
[borg01w063:03815] mca:oob:select: Adding component to end
[borg01w063:03815] mca:oob:select: Found 1 active transports
[borg01w063:03815] [[49163,0],0] plm:base:receive start comm
[borg01w063:03815] [[49163,0],0] plm:base:setup_job
[borg01w063:03815] [[49163,0],0] plm:slurm: LAUNCH DAEMONS CALLED
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm creating map
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon [[49163,0],1]
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon [[49163,0],1] to node borg01w064
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon [[49163,0],2]
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon [[49163,0],2] to node borg01w065
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon [[49163,0],3]
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon [[49163,0],3] to node borg01w069
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon [[49163,0],4]
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon [[49163,0],4] to node borg01w070
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon [[49163,0],5]
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon [[49163,0],5] to node borg01w071
[borg01w063:03815] [[49163,0],0] plm:slurm: launching on nodes borg01w064,borg01w065,borg01w069,borg01w070,borg01w071
[borg01w063:03815] [[49163,0],0] plm:slurm: Set prefix:/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug
[borg01w063:03815] [[49163,0],0] plm:slurm: final top-level argv:
srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=5 --nodelist=borg01w064,borg01w065,borg01w069,borg01w070,borg01w071 --ntasks=5 orted -mca orte_debug_daemons 1 -mca orte_leave_session_attached 1 -mca orte_ess_jobid 3221946368 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 6 -mca orte_hnp_uri 3221946368.0;tcp://10.1.24.63,172.31.1.254,10.12.24.63:41373 --mca oob_base_verbose 10 -mca plm_base_verbose 5
[borg01w063:03815] [[49163,0],0] plm:slurm: reset PATH: /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin:/usr/local/other/SLES11/gcc/4.9.1/bin:/usr/local/other/SLES11.1/git/1.8.5.2/libexec/git-core:/usr/local/other/SLES11.1/git/1.8.5.2/bin:/usr/local/other/SLES11/svn/1.6.17/bin:/usr/local/other/SLES11/tkcvs/8.2.3/gcc-4.3.2/bin:.:/home/mathomp4/bin:/home/mathomp4/cvstools:/discover/nobackup/projects/gmao/share/dasilva/opengrads/Contents:/usr/local/other/Htop/1.0/bin:/usr/local/other/SLES11/gnuplot/4.6.0/gcc-4.3.2/bin:/usr/local/other/SLES11/xpdf/3.03-gcc-4.3.2/bin:/home/mathomp4/src/pdtoolkit-3.16/x86_64/bin:/discover/nobackup/mathomp4/WavewatchIII-GMAO/bin:/discover/nobackup/mathomp4/WavewatchIII-GMAO/exe:/usr/local/other/pods:/usr/local/other/SLES11.1/R/3.1.0/gcc-4.3.4/lib64/R/bin:.:/home/mathomp4/bin:/home/mathomp4/cvstools:/discover/nobackup/projects/gmao/share/dasilva/opengrads/Contents:/usr/local/other/Htop/1.0/bin:/usr/local/other/SLES11/gnuplot/4.6.0/gcc-4.3.2/bin:/usr/local/other/SLES11/xpdf/3.03-gcc-4.3.2/bin:/home/mathomp4/src/pdtoolkit-3.16/x86_64/bin:/discover/nobackup/mathomp4/WavewatchIII-GMAO/bin:/discover/nobackup/mathomp4/WavewatchIII-GMAO/exe:/usr/local/other/pods:/usr/local/other/SLES11.1/R/3.1.0/gcc-4.3.4/lib64/R/bin:/home/mathomp4/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/opt/kde3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/usr/slurm/bin
[borg01w063:03815] [[49163,0],0] plm:slurm: reset LD_LIBRARY_PATH: /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/lib:/usr/local/other/SLES11/gcc/4.9.1/lib64:/usr/local/other/SLES11.1/git/1.8.5.2/lib:/usr/local/other/SLES11/svn/1.6.17/lib:/usr/local/other/SLES11/tkcvs/8.2.3/gcc-4.3.2/lib
srun.slurm: cluster configuration lacks support for cpu binding
srun.slurm: cluster configuration lacks support for cpu binding
[borg01w065:15893] mca: base: components_register: registering oob components
[borg01w065:15893] mca: base: components_register: found loaded component tcp
[borg01w065:15893] mca: base: components_register: component tcp register function successful
[borg01w065:15893] mca: base: components_open: opening oob components
[borg01w065:15893] mca: base: components_open: found loaded component tcp
[borg01w065:15893] mca: base: components_open: component tcp open function successful
[borg01w065:15893] mca:oob:select: checking available component tcp
[borg01w065:15893] mca:oob:select: Querying component [tcp]
[borg01w065:15893] oob:tcp: component_available called
[borg01w065:15893] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01w065:15893] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01w065:15893] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01w065:15893] [[49163,0],2] oob:tcp:init adding 10.1.24.65 to our list of V4 connections
[borg01w065:15893] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01w065:15893] [[49163,0],2] oob:tcp:init adding 172.31.1.254 to our list of V4 connections
[borg01w065:15893] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01w065:15893] [[49163,0],2] oob:tcp:init adding 10.12.24.65 to our list of V4 connections
[borg01w065:15893] [[49163,0],2] TCP STARTUP
[borg01w065:15893] [[49163,0],2] attempting to bind to IPv4 port 0
[borg01w065:15893] [[49163,0],2] assigned IPv4 port 43456
[borg01w065:15893] mca:oob:select: Adding component to end
[borg01w065:15893] mca:oob:select: Found 1 active transports
[borg01w070:12645] mca: base: components_register: registering oob components
[borg01w070:12645] mca: base: components_register: found loaded component tcp
[borg01w070:12645] mca: base: components_register: component tcp register function successful
[borg01w070:12645] mca: base: components_open: opening oob components
[borg01w070:12645] mca: base: components_open: found loaded component tcp
[borg01w070:12645] mca: base: components_open: component tcp open function successful
[borg01w070:12645] mca:oob:select: checking available component tcp
[borg01w070:12645] mca:oob:select: Querying component [tcp]
[borg01w070:12645] oob:tcp: component_available called
[borg01w070:12645] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01w070:12645] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01w070:12645] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01w070:12645] [[49163,0],4] oob:tcp:init adding 10.1.24.70 to our list of V4 connections
[borg01w070:12645] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01w070:12645] [[49163,0],4] oob:tcp:init adding 172.31.1.254 to our list of V4 connections
[borg01w070:12645] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01w070:12645] [[49163,0],4] oob:tcp:init adding 10.12.24.70 to our list of V4 connections
[borg01w070:12645] [[49163,0],4] TCP STARTUP
[borg01w070:12645] [[49163,0],4] attempting to bind to IPv4 port 0
[borg01w070:12645] [[49163,0],4] assigned IPv4 port 53062
[borg01w070:12645] mca:oob:select: Adding component to end
[borg01w070:12645] mca:oob:select: Found 1 active transports
[borg01w064:16565] mca: base: components_register: registering oob components
[borg01w064:16565] mca: base: components_register: found loaded component tcp
[borg01w064:16565] mca: base: components_register: component tcp register function successful
[borg01w071:14879] mca: base: components_register: registering oob components
[borg01w071:14879] mca: base: components_register: found loaded component tcp
[borg01w064:16565] mca: base: components_open: opening oob components
[borg01w064:16565] mca: base: components_open: found loaded component tcp
[borg01w064:16565] mca: base: components_open: component tcp open function successful
[borg01w064:16565] mca:oob:select: checking available component tcp
[borg01w064:16565] mca:oob:select: Querying component [tcp]
[borg01w064:16565] oob:tcp: component_available called
[borg01w064:16565] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01w064:16565] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01w064:16565] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01w064:16565] [[49163,0],1] oob:tcp:init adding 10.1.24.64 to our list of V4 connections
[borg01w064:16565] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01w064:16565] [[49163,0],1] oob:tcp:init adding 172.31.1.254 to our list of V4 connections
[borg01w064:16565] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01w064:16565] [[49163,0],1] oob:tcp:init adding 10.12.24.64 to our list of V4 connections
[borg01w064:16565] [[49163,0],1] TCP STARTUP
[borg01w064:16565] [[49163,0],1] attempting to bind to IPv4 port 0
[borg01w064:16565] [[49163,0],1] assigned IPv4 port 43828
[borg01w064:16565] mca:oob:select: Adding component to end
[borg01w069:30276] mca: base: components_register: registering oob components
[borg01w069:30276] mca: base: components_register: found loaded component tcp
[borg01w071:14879] mca: base: components_register: component tcp register function successful
[borg01w069:30276] mca: base: components_register: component tcp register function successful
[borg01w071:14879] mca: base: components_open: opening oob components
[borg01w071:14879] mca: base: components_open: found loaded component tcp
[borg01w071:14879] mca: base: components_open: component tcp open function successful
[borg01w071:14879] mca:oob:select: checking available component tcp
[borg01w071:14879] mca:oob:select: Querying component [tcp]
[borg01w071:14879] oob:tcp: component_available called
[borg01w069:30276] mca: base: components_open: opening oob components
[borg01w069:30276] mca: base: components_open: found loaded component tcp
[borg01w069:30276] mca: base: components_open: component tcp open function successful
[borg01w071:14879] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01w071:14879] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01w071:14879] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01w071:14879] [[49163,0],5] oob:tcp:init adding 10.1.24.71 to our list of V4 connections
[borg01w071:14879] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01w069:30276] mca:oob:select: checking available component tcp
[borg01w069:30276] mca:oob:select: Querying component [tcp]
[borg01w069:30276] oob:tcp: component_available called
[borg01w071:14879] [[49163,0],5] oob:tcp:init adding 172.31.1.254 to our list of V4 connections
[borg01w071:14879] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01w071:14879] [[49163,0],5] oob:tcp:init adding 10.12.24.71 to our list of V4 connections
[borg01w071:14879] [[49163,0],5] TCP STARTUP
[borg01w069:30276] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01w069:30276] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01w069:30276] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01w069:30276] [[49163,0],3] oob:tcp:init adding 10.1.24.69 to our list of V4 connections
[borg01w069:30276] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01w069:30276] [[49163,0],3] oob:tcp:init adding 172.31.1.254 to our list of V4 connections
[borg01w069:30276] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01w069:30276] [[49163,0],3] oob:tcp:init adding 10.12.24.69 to our list of V4 connections
[borg01w069:30276] [[49163,0],3] TCP STARTUP
[borg01w071:14879] [[49163,0],5] attempting to bind to IPv4 port 0
[borg01w069:30276] [[49163,0],3] attempting to bind to IPv4 port 0
[borg01w069:30276] [[49163,0],3] assigned IPv4 port 39299
[borg01w064:16565] mca:oob:select: Found 1 active transports
[borg01w069:30276] mca:oob:select: Adding component to end
[borg01w069:30276] mca:oob:select: Found 1 active transports
[borg01w071:14879] [[49163,0],5] assigned IPv4 port 56113
[borg01w071:14879] mca:oob:select: Adding component to end
[borg01w071:14879] mca:oob:select: Found 1 active transports
srun.slurm: error: borg01w064: task 0: Exited with exit code 213
srun.slurm: Terminating job step 2347743.3
srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to finish.
[borg01w070:12645] [[49163,0],4] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161
[borg01w070:12645] [[49163,0],4] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498
[borg01w070:12645] [[49163,0],4] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539
[borg01w065:15893] [[49163,0],2] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161
[borg01w065:15893] [[49163,0],2] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498
[borg01w065:15893] [[49163,0],2] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539
slurmd[borg01w065]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH SIGNAL 9 ***
slurmd[borg01w070]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH SIGNAL 9 ***
[borg01w064:16565] [[49163,0],1] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161
[borg01w064:16565] [[49163,0],1] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498
[borg01w064:16565] [[49163,0],1] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539
[borg01w069:30276] [[49163,0],3] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161
[borg01w069:30276] [[49163,0],3] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498
[borg01w069:30276] [[49163,0],3] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539
slurmd[borg01w069]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH SIGNAL 9 ***
[borg01w071:14879] [[49163,0],5] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161
[borg01w071:14879] [[49163,0],5] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498
[borg01w071:14879] [[49163,0],5] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539
slurmd[borg01w071]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH SIGNAL 9 ***
slurmd[borg01w065]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH SIGNAL 9 ***
slurmd[borg01w069]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH SIGNAL 9 ***
slurmd[borg01w070]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH SIGNAL 9 ***
slurmd[borg01w071]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 WITH SIGNAL 9 ***
srun.slurm: error: borg01w069: task 2: Exited with exit code 213
srun.slurm: error: borg01w065: task 1: Exited with exit code 213
srun.slurm: error: borg01w071: task 4: Exited with exit code 213
srun.slurm: error: borg01w070: task 3: Exited with exit code 213
sh: tcp://10.1.24.63,172.31.1.254,10.12.24.63:41373: No such file or directory
[borg01w063:03815] [[49163,0],0] plm:slurm: primary daemons complete!
[borg01w063:03815] [[49163,0],0] plm:base:receive stop comm
[borg01w063:03815] [[49163,0],0] TCP SHUTDOWN
[borg01w063:03815] mca: base: close: component tcp closed
[borg01w063:03815] mca: base: close: unloading component tcp



On Fri, Aug 29, 2014 at 3:18 PM, Ralph Castain <r...@open-mpi.org> wrote:
Rats - I also need "-mca plm_base_verbose 5" on there so I can see the cmd line being executed. Can you add it?


On Aug 29, 2014, at 11:16 AM, Matt Thompson <fort...@gmail.com> wrote:

Ralph,

Here you go:

(1080) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -np 8 ./helloWorld.182-debug.x
[borg01x142:29232] mca: base: components_register: registering oob components
[borg01x142:29232] mca: base: components_register: found loaded component tcp
[borg01x142:29232] mca: base: components_register: component tcp register function successful
[borg01x142:29232] mca: base: components_open: opening oob components
[borg01x142:29232] mca: base: components_open: found loaded component tcp
[borg01x142:29232] mca: base: components_open: component tcp open function successful
[borg01x142:29232] mca:oob:select: checking available component tcp
[borg01x142:29232] mca:oob:select: Querying component [tcp]
[borg01x142:29232] oob:tcp: component_available called
[borg01x142:29232] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x142:29232] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x142:29232] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.1.25.142 to our list of V4 connections
[borg01x142:29232] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x142:29232] [[52298,0],0] oob:tcp:init adding 172.31.1.254 to our list of V4 connections
[borg01x142:29232] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.12.25.142 to our list of V4 connections
[borg01x142:29232] [[52298,0],0] TCP STARTUP
[borg01x142:29232] [[52298,0],0] attempting to bind to IPv4 port 0
[borg01x142:29232] [[52298,0],0] assigned IPv4 port 41686
[borg01x142:29232] mca:oob:select: Adding component to end
[borg01x142:29232] mca:oob:select: Found 1 active transports
srun.slurm: cluster configuration lacks support for cpu binding
srun.slurm: cluster configuration lacks support for cpu binding
[borg01x153:01290] mca: base: components_register: registering oob components
[borg01x153:01290] mca: base: components_register: found loaded component tcp
[borg01x143:13793] mca: base: components_register: registering oob components
[borg01x143:13793] mca: base: components_register: found loaded component tcp
[borg01x153:01290] mca: base: components_register: component tcp register function successful
[borg01x153:01290] mca: base: components_open: opening oob components
[borg01x153:01290] mca: base: components_open: found loaded component tcp
[borg01x153:01290] mca: base: components_open: component tcp open function successful
[borg01x153:01290] mca:oob:select: checking available component tcp
[borg01x153:01290] mca:oob:select: Querying component [tcp]
[borg01x153:01290] oob:tcp: component_available called
[borg01x153:01290] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x153:01290] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x153:01290] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.1.25.153 to our list of V4 connections
[borg01x153:01290] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x153:01290] [[52298,0],4] oob:tcp:init adding 172.31.1.254 to our list of V4 connections
[borg01x153:01290] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.12.25.153 to our list of V4 connections
[borg01x153:01290] [[52298,0],4] TCP STARTUP
[borg01x153:01290] [[52298,0],4] attempting to bind to IPv4 port 0
[borg01x143:13793] mca: base: components_register: component tcp register function successful
[borg01x153:01290] [[52298,0],4] assigned IPv4 port 38028
[borg01x143:13793] mca: base: components_open: opening oob components
[borg01x143:13793] mca: base: components_open: found loaded component tcp
[borg01x143:13793] mca: base: components_open: component tcp open function successful
[borg01x143:13793] mca:oob:select: checking available component tcp
[borg01x143:13793] mca:oob:select: Querying component [tcp]
[borg01x143:13793] oob:tcp: component_available called
[borg01x143:13793] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x143:13793] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x143:13793] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.1.25.143 to our list of V4 connections
[borg01x143:13793] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x143:13793] [[52298,0],1] oob:tcp:init adding 172.31.1.254 to our list of V4 connections
[borg01x143:13793] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.12.25.143 to our list of V4 connections
[borg01x143:13793] [[52298,0],1] TCP STARTUP
[borg01x143:13793] [[52298,0],1] attempting to bind to IPv4 port 0
[borg01x153:01290] mca:oob:select: Adding component to end
[borg01x153:01290] mca:oob:select: Found 1 active transports
[borg01x143:13793] [[52298,0],1] assigned IPv4 port 44719
[borg01x143:13793] mca:oob:select: Adding component to end
[borg01x143:13793] mca:oob:select: Found 1 active transports
[borg01x144:30878] mca: base: components_register: registering oob components
[borg01x144:30878] mca: base: components_register: found loaded component tcp
[borg01x144:30878] mca: base: components_register: component tcp register function successful
[borg01x144:30878] mca: base: components_open: opening oob components
[borg01x144:30878] mca: base: components_open: found loaded component tcp
[borg01x144:30878] mca: base: components_open: component tcp open function successful
[borg01x144:30878] mca:oob:select: checking available component tcp
[borg01x144:30878] mca:oob:select: Querying component [tcp]
[borg01x144:30878] oob:tcp: component_available called
[borg01x144:30878] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x144:30878] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x144:30878] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x144:30878] [[52298,0],2] oob:tcp:init adding 10.1.25.144 to our list of V4 connections
[borg01x144:30878] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x144:30878] [[52298,0],2] oob:tcp:init adding 172.31.1.254 to our list of V4 connections
[borg01x144:30878] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x144:30878] [[52298,0],2] oob:tcp:init adding 10.12.25.144 to our list of V4 connections
[borg01x144:30878] [[52298,0],2] TCP STARTUP
[borg01x144:30878] [[52298,0],2] attempting to bind to IPv4 port 0
[borg01x144:30878] [[52298,0],2] assigned IPv4 port 40700
[borg01x144:30878] mca:oob:select: Adding component to end
[borg01x144:30878] mca:oob:select: Found 1 active transports
[borg01x154:01154] mca: base: components_register: registering oob components
[borg01x154:01154] mca: base: components_register: found loaded component tcp
[borg01x154:01154] mca: base: components_register: component tcp register function successful
[borg01x154:01154] mca: base: components_open: opening oob components
[borg01x154:01154] mca: base: components_open: found loaded component tcp
[borg01x154:01154] mca: base: components_open: component tcp open function successful
[borg01x154:01154] mca:oob:select: checking available component tcp
[borg01x154:01154] mca:oob:select: Querying component [tcp]
[borg01x154:01154] oob:tcp: component_available called
[borg01x154:01154] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x154:01154] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x154:01154] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x154:01154] [[52298,0],5] oob:tcp:init adding 10.1.25.154 to our list of V4 connections
[borg01x154:01154] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x154:01154] [[52298,0],5] oob:tcp:init adding 172.31.1.254 to our list of V4 connections
[borg01x154:01154] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x154:01154] [[52298,0],5] oob:tcp:init adding 10.12.25.154 to our list of V4 connections
[borg01x154:01154] [[52298,0],5] TCP STARTUP
[borg01x154:01154] [[52298,0],5] attempting to bind to IPv4 port 0
[borg01x154:01154] [[52298,0],5] assigned IPv4 port 41191
[borg01x154:01154] mca:oob:select: Adding component to end
[borg01x154:01154] mca:oob:select: Found 1 active transports
[borg01x145:02419] mca: base: components_register: registering oob components
[borg01x145:02419] mca: base: components_register: found loaded component tcp
[borg01x145:02419] mca: base: components_register: component tcp register function successful
[borg01x145:02419] mca: base: components_open: opening oob components
[borg01x145:02419] mca: base: components_open: found loaded component tcp
[borg01x145:02419] mca: base: components_open: component tcp open function successful
[borg01x145:02419] mca:oob:select: checking available component tcp
[borg01x145:02419] mca:oob:select: Querying component [tcp]
[borg01x145:02419] oob:tcp: component_available called
[borg01x145:02419] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x145:02419] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x145:02419] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x145:02419] [[52298,0],3] oob:tcp:init adding 10.1.25.145 to our list of V4 connections
[borg01x145:02419] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x145:02419] [[52298,0],3] oob:tcp:init adding 172.31.1.254 to our list of V4 connections
[borg01x145:02419] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x145:02419] [[52298,0],3] oob:tcp:init adding 10.12.25.145 to our list of V4 connections
[borg01x145:02419] [[52298,0],3] TCP STARTUP
[borg01x145:02419] [[52298,0],3] attempting to bind to IPv4 port 0
[borg01x145:02419] [[52298,0],3] assigned IPv4 port 51079
[borg01x145:02419] mca:oob:select: Adding component to end
[borg01x145:02419] mca:oob:select: Found 1 active transports
[borg01x144:30878] [[52298,0],2] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161
[borg01x144:30878] [[52298,0],2] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498
[borg01x144:30878] [[52298,0],2] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539
srun.slurm: error: borg01x143: task 0: Exited with exit code 213
srun.slurm: Terminating job step 2332583.24
slurmd[borg01x144]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH SIGNAL 9 ***
srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun.slurm: error: borg01x153: task 3: Exited with exit code 213
[borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161
[borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498
[borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539
[borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161
[borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498
[borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539
slurmd[borg01x144]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH SIGNAL 9 ***
srun.slurm: error: borg01x144: task 1: Exited with exit code 213
[borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161
[borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498
[borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539
slurmd[borg01x154]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH SIGNAL 9 ***
slurmd[borg01x154]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH SIGNAL 9 ***
srun.slurm: error: borg01x154: task 4: Exited with exit code 213
srun.slurm: error: borg01x145: task 2: Exited with exit code 213
[borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161
[borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498
[borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539
slurmd[borg01x145]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH SIGNAL 9 ***
slurmd[borg01x145]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 WITH SIGNAL 9 ***
sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:41686: No such file or directory
[borg01x142:29232] [[52298,0],0] TCP SHUTDOWN
[borg01x142:29232] mca: base: close: component tcp closed
[borg01x142:29232] mca: base: close: unloading component tcp

Note, if I can get the allocation today, I want to try doing all this on a single SandyBridge node, rather than on 6. It might make comparing various runs a bit easier!

Matt



On Fri, Aug 29, 2014 at 12:42 PM, Ralph Castain <r...@open-mpi.org> wrote:
Okay, something quite weird is happening here. I can't replicate using the 1.8.2 release tarball on a slurm machine, so my guess is that something else is going on here.

Could you please rebuild the 1.8.2 code with --enable-debug on the configure line (assuming you haven't already done so), and then rerun that version as before but adding "--mca oob_base_verbose 10" to the cmd line?


On Aug 29, 2014, at 4:22 AM, Matt Thompson <fort...@gmail.com> wrote:

Ralph,

For 1.8.2rc4 I get:

(1003) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x
srun.slurm: cluster configuration lacks support for cpu binding
srun.slurm: cluster configuration lacks support for cpu binding
Daemon [[47143,0],5] checking in as pid 10990 on host borg01x154
[borg01x154:10990] [[47143,0],5] orted: up and running - waiting for commands!
Daemon [[47143,0],1] checking in as pid 23473 on host borg01x143
Daemon [[47143,0],2] checking in as pid 8250 on host borg01x144
[borg01x144:08250] [[47143,0],2] orted: up and running - waiting for commands!
[borg01x143:23473] [[47143,0],1] orted: up and running - waiting for commands!
Daemon [[47143,0],3] checking in as pid 12320 on host borg01x145
Daemon [[47143,0],4] checking in as pid 10902 on host borg01x153
[borg01x153:10902] [[47143,0],4] orted: up and running - waiting for commands!
[borg01x145:12320] [[47143,0],3] orted: up and running - waiting for commands!
[borg01x142:01629] [[47143,0],0] orted_cmd: received add_local_procs
[borg01x144:08250] [[47143,0],2] orted_cmd: received add_local_procs
[borg01x153:10902] [[47143,0],4] orted_cmd: received add_local_procs
[borg01x143:23473] [[47143,0],1] orted_cmd: received add_local_procs
[borg01x145:12320] [[47143,0],3] orted_cmd: received add_local_procs
[borg01x154:10990] [[47143,0],5] orted_cmd: received add_local_procs
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local proc [[47143,1],0]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local proc [[47143,1],2]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local proc [[47143,1],3]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local proc [[47143,1],1]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local proc [[47143,1],5]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local proc [[47143,1],4]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local proc [[47143,1],6]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local proc [[47143,1],7]
  MPIR_being_debugged = 0
  MPIR_debug_state = 1
  MPIR_partial_attach_ok = 1
  MPIR_i_am_starter = 0
  MPIR_forward_output = 0
  MPIR_proctable_size = 8
  MPIR_proctable:
    (i, host, exe, pid) = (0, borg01x142, /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1647)
    (i, host, exe, pid) = (1, borg01x142, /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1648)
    (i, host, exe, pid) = (2, borg01x142, /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1650)
    (i, host, exe, pid) = (3, borg01x142, /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1652)
    (i, host, exe, pid) = (4, borg01x142, /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1654)
    (i, host, exe, pid) = (5, borg01x142, /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1656)
    (i, host, exe, pid) = (6, borg01x142, /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1658)
    (i, host, exe, pid) = (7, borg01x142, /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1660)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs
[borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs
[borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs
[borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs
[borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs
[borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs
[borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs
[borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs
[borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs
[borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs
[borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs
Process    2 of    8 is on borg01x142
Process    5 of    8 is on borg01x142
Process    4 of    8 is on borg01x142
Process    1 of    8 is on borg01x142
Process    0 of    8 is on borg01x142
Process    3 of    8 is on borg01x142
Process    6 of    8 is on borg01x142
Process    7 of    8 is on borg01x142
[borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs
[borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs
[borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs
[borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs
[borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs
[borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs
[borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs
[borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc [[47143,1],2]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc [[47143,1],1]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc [[47143,1],3]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc [[47143,1],0]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc [[47143,1],4]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc [[47143,1],6]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc [[47143,1],5]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc [[47143,1],7]
[borg01x142:01629] [[47143,0],0] orted_cmd: received exit cmd
[borg01x144:08250] [[47143,0],2] orted_cmd: received exit cmd
[borg01x144:08250] [[47143,0],2] orted_cmd: all routes and children gone - exiting
[borg01x153:10902] [[47143,0],4] orted_cmd: received exit cmd
[borg01x153:10902] [[47143,0],4] orted_cmd: all routes and children gone - exiting
[borg01x143:23473] [[47143,0],1] orted_cmd: received exit cmd
[borg01x154:10990] [[47143,0],5] orted_cmd: received exit cmd
[borg01x154:10990] [[47143,0],5] orted_cmd: all routes and children gone - exiting
[borg01x145:12320] [[47143,0],3] orted_cmd: received exit cmd
[borg01x145:12320] [[47143,0],3] orted_cmd: all routes and children gone - exiting

Using the 1.8.2 mpirun:

(1004) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x
srun.slurm: cluster configuration lacks support for cpu binding
srun.slurm: cluster configuration lacks support for cpu binding
[borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161
[borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498
[borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539
srun.slurm: error: borg01x143: task 0: Exited with exit code 213
srun.slurm: Terminating job step 2332583.4
[borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161
[borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498
[borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539
[borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in file base/rml_base_contact.c at line 161
[borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in file routed_binomial.c at line 498
[borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in file base/ess_base_std_orted.c at line 539
srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[borg01x145]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH SIGNAL 9 ***
slurmd[borg01x154]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH SIGNAL 9 ***
slurmd[borg01x153]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH SIGNAL 9 ***
slurmd[borg01x153]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH SIGNAL 9 ***
srun.slurm: error: borg01x144: task 1: Exited with exit code 213
slurmd[borg01x144]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH SIGNAL 9 ***
slurmd[borg01x144]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH SIGNAL 9 ***
srun.slurm: error: borg01x153: task 3: Exited with exit code 213
slurmd[borg01x154]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH SIGNAL 9 ***
slurmd[borg01x145]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH SIGNAL 9 ***
srun.slurm: error: borg01x154: task 4: Killed
srun.slurm: error: borg01x145: task 2: Killed
sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:34169: No such file or directory




On Thu, Aug 28, 2014 at 7:17 PM, Ralph Castain <r...@open-mpi.org> wrote:
I'm unaware of any changes to the Slurm integration between rc4 and final release. It sounds like this might be something else going on - try adding "--leave-session-attached --debug-daemons" to your 1.8.2 command line and let's see if any errors get reported.


On Aug 28, 2014, at 12:20 PM, Matt Thompson <fort...@gmail.com> wrote:

Open MPI List,

I recently encountered an odd bug with Open MPI 1.8.1 and GCC 4.9.1 on our cluster (reported on this list), and decided to try it with 1.8.2. However, we seem to be having an issue with Open MPI 1.8.2 and SLURM. Even weirder, Open MPI 1.8.2rc4 doesn't show the bug. And the bug is: I get no stdout with Open MPI 1.8.2. That is, HelloWorld doesn't work.

To wit, our sysadmin has two tarballs:

(1441) $ sha1sum openmpi-1.8.2rc4.tar.bz2
7e7496913c949451f546f22a1a159df25f8bb683  openmpi-1.8.2rc4.tar.bz2
(1442) $ sha1sum openmpi-1.8.2.tar.gz
cf2b1e45575896f63367406c6c50574699d8b2e1  openmpi-1.8.2.tar.gz

I then build each with a script in the method our sysadmin usually does:

#!/bin/sh 
set -x
export PREFIX=/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/nlocal/slurm/2.6.3/lib64
build() {
  echo `pwd`
  ./configure --with-slurm --disable-wrapper-rpath --enable-shared --enable-mca-no-build=btl-usnic \
      CC=gcc CXX=g++ F77=gfortran FC=gfortran \
      CFLAGS="-mtune=generic -fPIC -m64" CXXFLAGS="-mtune=generic -fPIC -m64" FFLAGS="-mtune=generic -fPIC -m64" \
      F77FLAGS="-mtune=generic -fPIC -m64" FCFLAGS="-mtune=generic -fPIC -m64" F90FLAGS="-mtune=generic -fPIC -m64" \
      LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64" CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include" LIBS="-lpciaccess" \
     --prefix=${PREFIX} 2>&1 | tee configure.1.8.2.log
  make 2>&1 | tee make.1.8.2.log
  make check 2>&1 | tee makecheck.1.8.2.log
  make install 2>&1 | tee makeinstall.1.8.2.log
}
echo "calling build"
build
echo "exiting"

The only difference between the two is '1.8.2' or '1.8.2rc4' in the PREFIX and log file tees.  Now, let us test. First, I grab some nodes with slurm:

$ salloc --nodes=6 --ntasks-per-node=16 --constraint=sand --time=09:00:00 --account=g0620 --mail-type=BEGIN

Once I get my nodes, I run with 1.8.2rc4:

(1142) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpifort -o helloWorld.182rc4.x helloWorld.F90
(1143) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8 ./helloWorld.182rc4.x
Process    0 of    8 is on borg01w044
Process    5 of    8 is on borg01w044
Process    3 of    8 is on borg01w044
Process    7 of    8 is on borg01w044
Process    1 of    8 is on borg01w044
Process    2 of    8 is on borg01w044
Process    4 of    8 is on borg01w044
Process    6 of    8 is on borg01w044

Now 1.8.2:

(1144) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpifort -o helloWorld.182.x helloWorld.F90
(1145) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun -np 8 ./helloWorld.182.x
(1146) $

No output at all. But, if I take the helloWorld.x from 1.8.2 and run it with 1.8.2rc4's mpirun:

(1146) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8 ./helloWorld.182.x
Process    5 of    8 is on borg01w044
Process    7 of    8 is on borg01w044
Process    2 of    8 is on borg01w044
Process    4 of    8 is on borg01w044
Process    1 of    8 is on borg01w044
Process    3 of    8 is on borg01w044
Process    6 of    8 is on borg01w044
Process    0 of    8 is on borg01w044

So...any idea what is happening here? There did seem to be a few SLURM related changes between the two tarballs involving /dev/null but it's a bit above me to decipher.

You can find the ompi_info, build, make, config, etc logs at these links (they are ~300kB which is over the mailing list limit according to the Open MPI web page):


Thank you for any help and please let me know if you need more information,
Matt

--
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy, 
 oooh, oooh, oooh, the sky is the limit!" -- The Tick

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25182.php


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25184.php



--
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy, 
 oooh, oooh, oooh, the sky is the limit!" -- The Tick

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25187.php


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25193.php



--
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy, 
 oooh, oooh, oooh, the sky is the limit!" -- The Tick

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25196.php


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25197.php



--
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy, 
 oooh, oooh, oooh, the sky is the limit!" -- The Tick

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25204.php

Reply via email to