Jeff, I tried your script and I saw:
(1027) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun -np 8 ./script.sh (1028) $ Now, the very first time I ran it, I think I might have noticed a blip of orted on the nodes, but it disappeared fast. When I re-run the same command, it just seems to exit immediately with nothing showing up. If I use my "debug-patch" version, I see: (1028) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch//bin/mpirun -np 8 ./script.sh hello world hello world hello world hello world hello world hello world hello world hello world And, well, it's there for 10 minutes, I'm guessing. If I ssh to another of the nodes in my allocation: (1005) $ ps aux | grep openmpi mathomp4 20317 0.0 0.0 59952 4256 ? S 09:17 0:00 /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/orted -mca orte_ess_jobid 1842544640 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 6 -mca orte_hnp_uri 1842544640.0;tcp://10.1.24.169,172.31.1.254, 10.12.24.169:41684 mathomp4 20389 0.0 0.0 5524 844 pts/0 S+ 09:19 0:00 grep --color=auto openmpi Matt On Tue, Sep 2, 2014 at 5:35 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > Matt -- > > We were discussing this issue on our weekly OMPI engineering call today. > > Can you check one thing for me? With the un-edited 1.8.2 tarball > installation, I see that you're getting no output for commands that you run > -- but also no errors. > > Can you verify and see if your commands are actually *running*? E.g, try: > > $ cat > script.sh <<EOF > #!/bin/sh > echo hello world > sleep 600 > echo goodbye world > EOF > $ chmod +x script.sh > $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0 > $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-clean/bin/mpirun > -np 8 script.sh > > and then go "ps" on the back-end nodes and see if there is an "orted" > process and N "sleep 600" processes running on them. > > I'm *assuming* you won't see the "hello world" output. > > The purpose of this test is that I want to see if OMPI is just totally > erring out and not even running your job (which is quite unlikely; OMPI > should be much more noisy when this happens), or whether we're simply not > seeing the stdout from the job. > > Thanks. > > > > On Sep 2, 2014, at 9:36 AM, Matt Thompson <fort...@gmail.com> wrote: > > > On that machine, it would be SLES 11 SP1. I think it's soon > transitioning to SLES 11 SP3. > > > > I also use Open MPI on an RHEL 6.5 box (possibly soon to be RHEL 7). > > > > > > On Mon, Sep 1, 2014 at 8:41 PM, Ralph Castain <r...@open-mpi.org> wrote: > > Thanks - I expect we'll have to release 1.8.3 soon to fix this in case > others have similar issues. Out of curiosity, what OS are you using? > > > > > > On Sep 1, 2014, at 9:00 AM, Matt Thompson <fort...@gmail.com> wrote: > > > >> Ralph, > >> > >> Okay that seems to have done it here (well, minus the usual > shmem_mmap_enable_nfs_warning that our system always generates): > >> > >> (1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0 > >> (1034) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun > -np 8 ./helloWorld.182-debug-patch.x > >> Process 7 of 8 is on borg01w218 > >> Process 5 of 8 is on borg01w218 > >> Process 1 of 8 is on borg01w218 > >> Process 3 of 8 is on borg01w218 > >> Process 0 of 8 is on borg01w218 > >> Process 2 of 8 is on borg01w218 > >> Process 4 of 8 is on borg01w218 > >> Process 6 of 8 is on borg01w218 > >> > >> I'll ask the admin to apply the patch locally...and wait for 1.8.3, I > suppose. > >> > >> Thanks, > >> Matt > >> > >> On Sun, Aug 31, 2014 at 10:08 AM, Ralph Castain <r...@open-mpi.org> > wrote: > >> Hmmm....I may see the problem. Would you be so kind as to apply the > attached patch to your 1.8.2 code, rebuild, and try again? > >> > >> Much appreciate the help. Everyone's system is slightly different, and > I think you've uncovered one of those differences. > >> Ralph > >> > >> > >> > >> On Aug 31, 2014, at 6:25 AM, Matt Thompson <fort...@gmail.com> wrote: > >> > >>> Ralph, > >>> > >>> Sorry it took me a bit of time. Here you go: > >>> > >>> (1002) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun > --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca > plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x > >>> [borg01w063:03815] mca:base:select:( plm) Querying component > [isolated] > >>> [borg01w063:03815] mca:base:select:( plm) Query of component > [isolated] set priority to 0 > >>> [borg01w063:03815] mca:base:select:( plm) Querying component [rsh] > >>> [borg01w063:03815] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : > rsh path NULL > >>> [borg01w063:03815] mca:base:select:( plm) Query of component [rsh] > set priority to 10 > >>> [borg01w063:03815] mca:base:select:( plm) Querying component [slurm] > >>> [borg01w063:03815] [[INVALID],INVALID] plm:slurm: available for > selection > >>> [borg01w063:03815] mca:base:select:( plm) Query of component [slurm] > set priority to 75 > >>> [borg01w063:03815] mca:base:select:( plm) Selected component [slurm] > >>> [borg01w063:03815] plm:base:set_hnp_name: initial bias 3815 nodename > hash 1757783593 > >>> [borg01w063:03815] plm:base:set_hnp_name: final jobfam 49163 > >>> [borg01w063:03815] mca: base: components_register: registering oob > components > >>> [borg01w063:03815] mca: base: components_register: found loaded > component tcp > >>> [borg01w063:03815] mca: base: components_register: component tcp > register function successful > >>> [borg01w063:03815] mca: base: components_open: opening oob components > >>> [borg01w063:03815] mca: base: components_open: found loaded component > tcp > >>> [borg01w063:03815] mca: base: components_open: component tcp open > function successful > >>> [borg01w063:03815] mca:oob:select: checking available component tcp > >>> [borg01w063:03815] mca:oob:select: Querying component [tcp] > >>> [borg01w063:03815] oob:tcp: component_available called > >>> [borg01w063:03815] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > >>> [borg01w063:03815] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > >>> [borg01w063:03815] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > >>> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.1.24.63 to our > list of V4 connections > >>> [borg01w063:03815] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > >>> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 172.31.1.254 to > our list of V4 connections > >>> [borg01w063:03815] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > >>> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.12.24.63 to > our list of V4 connections > >>> [borg01w063:03815] [[49163,0],0] TCP STARTUP > >>> [borg01w063:03815] [[49163,0],0] attempting to bind to IPv4 port 0 > >>> [borg01w063:03815] [[49163,0],0] assigned IPv4 port 41373 > >>> [borg01w063:03815] mca:oob:select: Adding component to end > >>> [borg01w063:03815] mca:oob:select: Found 1 active transports > >>> [borg01w063:03815] [[49163,0],0] plm:base:receive start comm > >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_job > >>> [borg01w063:03815] [[49163,0],0] plm:slurm: LAUNCH DAEMONS CALLED > >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm > >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm creating map > >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon > [[49163,0],1] > >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new > daemon [[49163,0],1] to node borg01w064 > >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon > [[49163,0],2] > >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new > daemon [[49163,0],2] to node borg01w065 > >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon > [[49163,0],3] > >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new > daemon [[49163,0],3] to node borg01w069 > >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon > [[49163,0],4] > >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new > daemon [[49163,0],4] to node borg01w070 > >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon > [[49163,0],5] > >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new > daemon [[49163,0],5] to node borg01w071 > >>> [borg01w063:03815] [[49163,0],0] plm:slurm: launching on nodes > borg01w064,borg01w065,borg01w069,borg01w070,borg01w071 > >>> [borg01w063:03815] [[49163,0],0] plm:slurm: Set > prefix:/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug > >>> [borg01w063:03815] [[49163,0],0] plm:slurm: final top-level argv: > >>> srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none > --nodes=5 --nodelist=borg01w064,borg01w065,borg01w069,borg01w070,borg01w071 > --ntasks=5 orted -mca orte_debug_daemons 1 -mca orte_leave_session_attached > 1 -mca orte_ess_jobid 3221946368 -mca orte_ess_vpid 1 -mca > orte_ess_num_procs 6 -mca orte_hnp_uri 3221946368.0;tcp://10.1.24.63 > ,172.31.1.254,10.12.24.63:41373 --mca oob_base_verbose 10 -mca > plm_base_verbose 5 > >>> [borg01w063:03815] [[49163,0],0] plm:slurm: reset PATH: > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin:/usr/local/other/SLES11/gcc/4.9.1/bin:/usr/local/other/SLES11.1/git/ > 1.8.5.2/libexec/git-core:/usr/local/other/SLES11.1/git/1.8.5.2/bin:/usr/local/other/SLES11/svn/1.6.17/bin:/usr/local/other/SLES11/tkcvs/8.2.3/gcc-4.3.2/bin:.:/home/mathomp4/bin:/home/mathomp4/cvstools:/discover/nobackup/projects/gmao/share/dasilva/opengrads/Contents:/usr/local/other/Htop/1.0/bin:/usr/local/other/SLES11/gnuplot/4.6.0/gcc-4.3.2/bin:/usr/local/other/SLES11/xpdf/3.03-gcc-4.3.2/bin:/home/mathomp4/src/pdtoolkit-3.16/x86_64/bin:/discover/nobackup/mathomp4/WavewatchIII-GMAO/bin:/discover/nobackup/mathomp4/WavewatchIII-GMAO/exe:/usr/local/other/pods:/usr/local/other/SLES11.1/R/3.1.0/gcc-4.3.4/lib64/R/bin:.:/home/mathomp4/bin:/home/mathomp4/cvstools:/discover/nobackup/projects/gmao/share/dasilva/opengrads/Contents:/usr/local/other/Htop/1.0/bin:/usr/local/other/SLES11/gnuplot/4.6 > > > .0/gcc-4.3.2/bin:/usr/local/other/SLES11/xpdf/3.03-gcc-4.3.2/bin:/home/mathomp4/src/pdtoolkit-3.16/x86_64/bin:/discover/nobackup/mathomp4/WavewatchIII-GMAO/bin:/discover/nobackup/mathomp4/WavewatchIII-GMAO/exe:/usr/local/other/pods:/usr/local/other/SLES11.1/R/3.1.0/gcc-4.3.4/lib64/R/bin:/home/mathomp4/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/opt/kde3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/usr/slurm/bin > >>> [borg01w063:03815] [[49163,0],0] plm:slurm: reset LD_LIBRARY_PATH: > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/lib:/usr/local/other/SLES11/gcc/4.9.1/lib64:/usr/local/other/SLES11.1/git/ > 1.8.5.2/lib:/usr/local/other/SLES11/svn/1.6.17/lib:/usr/local/other/SLES11/tkcvs/8.2.3/gcc-4.3.2/lib > >>> srun.slurm: cluster configuration lacks support for cpu binding > >>> srun.slurm: cluster configuration lacks support for cpu binding > >>> [borg01w065:15893] mca: base: components_register: registering oob > components > >>> [borg01w065:15893] mca: base: components_register: found loaded > component tcp > >>> [borg01w065:15893] mca: base: components_register: component tcp > register function successful > >>> [borg01w065:15893] mca: base: components_open: opening oob components > >>> [borg01w065:15893] mca: base: components_open: found loaded component > tcp > >>> [borg01w065:15893] mca: base: components_open: component tcp open > function successful > >>> [borg01w065:15893] mca:oob:select: checking available component tcp > >>> [borg01w065:15893] mca:oob:select: Querying component [tcp] > >>> [borg01w065:15893] oob:tcp: component_available called > >>> [borg01w065:15893] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > >>> [borg01w065:15893] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > >>> [borg01w065:15893] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > >>> [borg01w065:15893] [[49163,0],2] oob:tcp:init adding 10.1.24.65 to our > list of V4 connections > >>> [borg01w065:15893] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > >>> [borg01w065:15893] [[49163,0],2] oob:tcp:init adding 172.31.1.254 to > our list of V4 connections > >>> [borg01w065:15893] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > >>> [borg01w065:15893] [[49163,0],2] oob:tcp:init adding 10.12.24.65 to > our list of V4 connections > >>> [borg01w065:15893] [[49163,0],2] TCP STARTUP > >>> [borg01w065:15893] [[49163,0],2] attempting to bind to IPv4 port 0 > >>> [borg01w065:15893] [[49163,0],2] assigned IPv4 port 43456 > >>> [borg01w065:15893] mca:oob:select: Adding component to end > >>> [borg01w065:15893] mca:oob:select: Found 1 active transports > >>> [borg01w070:12645] mca: base: components_register: registering oob > components > >>> [borg01w070:12645] mca: base: components_register: found loaded > component tcp > >>> [borg01w070:12645] mca: base: components_register: component tcp > register function successful > >>> [borg01w070:12645] mca: base: components_open: opening oob components > >>> [borg01w070:12645] mca: base: components_open: found loaded component > tcp > >>> [borg01w070:12645] mca: base: components_open: component tcp open > function successful > >>> [borg01w070:12645] mca:oob:select: checking available component tcp > >>> [borg01w070:12645] mca:oob:select: Querying component [tcp] > >>> [borg01w070:12645] oob:tcp: component_available called > >>> [borg01w070:12645] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > >>> [borg01w070:12645] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > >>> [borg01w070:12645] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > >>> [borg01w070:12645] [[49163,0],4] oob:tcp:init adding 10.1.24.70 to our > list of V4 connections > >>> [borg01w070:12645] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > >>> [borg01w070:12645] [[49163,0],4] oob:tcp:init adding 172.31.1.254 to > our list of V4 connections > >>> [borg01w070:12645] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > >>> [borg01w070:12645] [[49163,0],4] oob:tcp:init adding 10.12.24.70 to > our list of V4 connections > >>> [borg01w070:12645] [[49163,0],4] TCP STARTUP > >>> [borg01w070:12645] [[49163,0],4] attempting to bind to IPv4 port 0 > >>> [borg01w070:12645] [[49163,0],4] assigned IPv4 port 53062 > >>> [borg01w070:12645] mca:oob:select: Adding component to end > >>> [borg01w070:12645] mca:oob:select: Found 1 active transports > >>> [borg01w064:16565] mca: base: components_register: registering oob > components > >>> [borg01w064:16565] mca: base: components_register: found loaded > component tcp > >>> [borg01w064:16565] mca: base: components_register: component tcp > register function successful > >>> [borg01w071:14879] mca: base: components_register: registering oob > components > >>> [borg01w071:14879] mca: base: components_register: found loaded > component tcp > >>> [borg01w064:16565] mca: base: components_open: opening oob components > >>> [borg01w064:16565] mca: base: components_open: found loaded component > tcp > >>> [borg01w064:16565] mca: base: components_open: component tcp open > function successful > >>> [borg01w064:16565] mca:oob:select: checking available component tcp > >>> [borg01w064:16565] mca:oob:select: Querying component [tcp] > >>> [borg01w064:16565] oob:tcp: component_available called > >>> [borg01w064:16565] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > >>> [borg01w064:16565] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > >>> [borg01w064:16565] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > >>> [borg01w064:16565] [[49163,0],1] oob:tcp:init adding 10.1.24.64 to our > list of V4 connections > >>> [borg01w064:16565] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > >>> [borg01w064:16565] [[49163,0],1] oob:tcp:init adding 172.31.1.254 to > our list of V4 connections > >>> [borg01w064:16565] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > >>> [borg01w064:16565] [[49163,0],1] oob:tcp:init adding 10.12.24.64 to > our list of V4 connections > >>> [borg01w064:16565] [[49163,0],1] TCP STARTUP > >>> [borg01w064:16565] [[49163,0],1] attempting to bind to IPv4 port 0 > >>> [borg01w064:16565] [[49163,0],1] assigned IPv4 port 43828 > >>> [borg01w064:16565] mca:oob:select: Adding component to end > >>> [borg01w069:30276] mca: base: components_register: registering oob > components > >>> [borg01w069:30276] mca: base: components_register: found loaded > component tcp > >>> [borg01w071:14879] mca: base: components_register: component tcp > register function successful > >>> [borg01w069:30276] mca: base: components_register: component tcp > register function successful > >>> [borg01w071:14879] mca: base: components_open: opening oob components > >>> [borg01w071:14879] mca: base: components_open: found loaded component > tcp > >>> [borg01w071:14879] mca: base: components_open: component tcp open > function successful > >>> [borg01w071:14879] mca:oob:select: checking available component tcp > >>> [borg01w071:14879] mca:oob:select: Querying component [tcp] > >>> [borg01w071:14879] oob:tcp: component_available called > >>> [borg01w069:30276] mca: base: components_open: opening oob components > >>> [borg01w069:30276] mca: base: components_open: found loaded component > tcp > >>> [borg01w069:30276] mca: base: components_open: component tcp open > function successful > >>> [borg01w071:14879] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > >>> [borg01w071:14879] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > >>> [borg01w071:14879] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > >>> [borg01w071:14879] [[49163,0],5] oob:tcp:init adding 10.1.24.71 to our > list of V4 connections > >>> [borg01w071:14879] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > >>> [borg01w069:30276] mca:oob:select: checking available component tcp > >>> [borg01w069:30276] mca:oob:select: Querying component [tcp] > >>> [borg01w069:30276] oob:tcp: component_available called > >>> [borg01w071:14879] [[49163,0],5] oob:tcp:init adding 172.31.1.254 to > our list of V4 connections > >>> [borg01w071:14879] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > >>> [borg01w071:14879] [[49163,0],5] oob:tcp:init adding 10.12.24.71 to > our list of V4 connections > >>> [borg01w071:14879] [[49163,0],5] TCP STARTUP > >>> [borg01w069:30276] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > >>> [borg01w069:30276] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > >>> [borg01w069:30276] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > >>> [borg01w069:30276] [[49163,0],3] oob:tcp:init adding 10.1.24.69 to our > list of V4 connections > >>> [borg01w069:30276] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > >>> [borg01w069:30276] [[49163,0],3] oob:tcp:init adding 172.31.1.254 to > our list of V4 connections > >>> [borg01w069:30276] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > >>> [borg01w069:30276] [[49163,0],3] oob:tcp:init adding 10.12.24.69 to > our list of V4 connections > >>> [borg01w069:30276] [[49163,0],3] TCP STARTUP > >>> [borg01w071:14879] [[49163,0],5] attempting to bind to IPv4 port 0 > >>> [borg01w069:30276] [[49163,0],3] attempting to bind to IPv4 port 0 > >>> [borg01w069:30276] [[49163,0],3] assigned IPv4 port 39299 > >>> [borg01w064:16565] mca:oob:select: Found 1 active transports > >>> [borg01w069:30276] mca:oob:select: Adding component to end > >>> [borg01w069:30276] mca:oob:select: Found 1 active transports > >>> [borg01w071:14879] [[49163,0],5] assigned IPv4 port 56113 > >>> [borg01w071:14879] mca:oob:select: Adding component to end > >>> [borg01w071:14879] mca:oob:select: Found 1 active transports > >>> srun.slurm: error: borg01w064: task 0: Exited with exit code 213 > >>> srun.slurm: Terminating job step 2347743.3 > >>> srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to > finish. > >>> [borg01w070:12645] [[49163,0],4] ORTE_ERROR_LOG: Bad parameter in file > base/rml_base_contact.c at line 161 > >>> [borg01w070:12645] [[49163,0],4] ORTE_ERROR_LOG: Bad parameter in file > routed_binomial.c at line 498 > >>> [borg01w070:12645] [[49163,0],4] ORTE_ERROR_LOG: Bad parameter in file > base/ess_base_std_orted.c at line 539 > >>> [borg01w065:15893] [[49163,0],2] ORTE_ERROR_LOG: Bad parameter in file > base/rml_base_contact.c at line 161 > >>> [borg01w065:15893] [[49163,0],2] ORTE_ERROR_LOG: Bad parameter in file > routed_binomial.c at line 498 > >>> [borg01w065:15893] [[49163,0],2] ORTE_ERROR_LOG: Bad parameter in file > base/ess_base_std_orted.c at line 539 > >>> slurmd[borg01w065]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 > WITH SIGNAL 9 *** > >>> slurmd[borg01w070]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 > WITH SIGNAL 9 *** > >>> [borg01w064:16565] [[49163,0],1] ORTE_ERROR_LOG: Bad parameter in file > base/rml_base_contact.c at line 161 > >>> [borg01w064:16565] [[49163,0],1] ORTE_ERROR_LOG: Bad parameter in file > routed_binomial.c at line 498 > >>> [borg01w064:16565] [[49163,0],1] ORTE_ERROR_LOG: Bad parameter in file > base/ess_base_std_orted.c at line 539 > >>> [borg01w069:30276] [[49163,0],3] ORTE_ERROR_LOG: Bad parameter in file > base/rml_base_contact.c at line 161 > >>> [borg01w069:30276] [[49163,0],3] ORTE_ERROR_LOG: Bad parameter in file > routed_binomial.c at line 498 > >>> [borg01w069:30276] [[49163,0],3] ORTE_ERROR_LOG: Bad parameter in file > base/ess_base_std_orted.c at line 539 > >>> slurmd[borg01w069]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 > WITH SIGNAL 9 *** > >>> [borg01w071:14879] [[49163,0],5] ORTE_ERROR_LOG: Bad parameter in file > base/rml_base_contact.c at line 161 > >>> [borg01w071:14879] [[49163,0],5] ORTE_ERROR_LOG: Bad parameter in file > routed_binomial.c at line 498 > >>> [borg01w071:14879] [[49163,0],5] ORTE_ERROR_LOG: Bad parameter in file > base/ess_base_std_orted.c at line 539 > >>> slurmd[borg01w071]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 > WITH SIGNAL 9 *** > >>> slurmd[borg01w065]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 > WITH SIGNAL 9 *** > >>> slurmd[borg01w069]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 > WITH SIGNAL 9 *** > >>> slurmd[borg01w070]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 > WITH SIGNAL 9 *** > >>> slurmd[borg01w071]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17 > WITH SIGNAL 9 *** > >>> srun.slurm: error: borg01w069: task 2: Exited with exit code 213 > >>> srun.slurm: error: borg01w065: task 1: Exited with exit code 213 > >>> srun.slurm: error: borg01w071: task 4: Exited with exit code 213 > >>> srun.slurm: error: borg01w070: task 3: Exited with exit code 213 > >>> sh: tcp://10.1.24.63,172.31.1.254,10.12.24.63:41373: No such file or > directory > >>> [borg01w063:03815] [[49163,0],0] plm:slurm: primary daemons complete! > >>> [borg01w063:03815] [[49163,0],0] plm:base:receive stop comm > >>> [borg01w063:03815] [[49163,0],0] TCP SHUTDOWN > >>> [borg01w063:03815] mca: base: close: component tcp closed > >>> [borg01w063:03815] mca: base: close: unloading component tcp > >>> > >>> > >>> > >>> On Fri, Aug 29, 2014 at 3:18 PM, Ralph Castain <r...@open-mpi.org> > wrote: > >>> Rats - I also need "-mca plm_base_verbose 5" on there so I can see the > cmd line being executed. Can you add it? > >>> > >>> > >>> On Aug 29, 2014, at 11:16 AM, Matt Thompson <fort...@gmail.com> wrote: > >>> > >>>> Ralph, > >>>> > >>>> Here you go: > >>>> > >>>> (1080) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun > --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -np 8 > ./helloWorld.182-debug.x > >>>> [borg01x142:29232] mca: base: components_register: registering oob > components > >>>> [borg01x142:29232] mca: base: components_register: found loaded > component tcp > >>>> [borg01x142:29232] mca: base: components_register: component tcp > register function successful > >>>> [borg01x142:29232] mca: base: components_open: opening oob components > >>>> [borg01x142:29232] mca: base: components_open: found loaded component > tcp > >>>> [borg01x142:29232] mca: base: components_open: component tcp open > function successful > >>>> [borg01x142:29232] mca:oob:select: checking available component tcp > >>>> [borg01x142:29232] mca:oob:select: Querying component [tcp] > >>>> [borg01x142:29232] oob:tcp: component_available called > >>>> [borg01x142:29232] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > >>>> [borg01x142:29232] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > >>>> [borg01x142:29232] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > >>>> [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.1.25.142 to > our list of V4 connections > >>>> [borg01x142:29232] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > >>>> [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 172.31.1.254 to > our list of V4 connections > >>>> [borg01x142:29232] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > >>>> [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.12.25.142 to > our list of V4 connections > >>>> [borg01x142:29232] [[52298,0],0] TCP STARTUP > >>>> [borg01x142:29232] [[52298,0],0] attempting to bind to IPv4 port 0 > >>>> [borg01x142:29232] [[52298,0],0] assigned IPv4 port 41686 > >>>> [borg01x142:29232] mca:oob:select: Adding component to end > >>>> [borg01x142:29232] mca:oob:select: Found 1 active transports > >>>> srun.slurm: cluster configuration lacks support for cpu binding > >>>> srun.slurm: cluster configuration lacks support for cpu binding > >>>> [borg01x153:01290] mca: base: components_register: registering oob > components > >>>> [borg01x153:01290] mca: base: components_register: found loaded > component tcp > >>>> [borg01x143:13793] mca: base: components_register: registering oob > components > >>>> [borg01x143:13793] mca: base: components_register: found loaded > component tcp > >>>> [borg01x153:01290] mca: base: components_register: component tcp > register function successful > >>>> [borg01x153:01290] mca: base: components_open: opening oob components > >>>> [borg01x153:01290] mca: base: components_open: found loaded component > tcp > >>>> [borg01x153:01290] mca: base: components_open: component tcp open > function successful > >>>> [borg01x153:01290] mca:oob:select: checking available component tcp > >>>> [borg01x153:01290] mca:oob:select: Querying component [tcp] > >>>> [borg01x153:01290] oob:tcp: component_available called > >>>> [borg01x153:01290] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > >>>> [borg01x153:01290] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > >>>> [borg01x153:01290] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > >>>> [borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.1.25.153 to > our list of V4 connections > >>>> [borg01x153:01290] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > >>>> [borg01x153:01290] [[52298,0],4] oob:tcp:init adding 172.31.1.254 to > our list of V4 connections > >>>> [borg01x153:01290] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > >>>> [borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.12.25.153 to > our list of V4 connections > >>>> [borg01x153:01290] [[52298,0],4] TCP STARTUP > >>>> [borg01x153:01290] [[52298,0],4] attempting to bind to IPv4 port 0 > >>>> [borg01x143:13793] mca: base: components_register: component tcp > register function successful > >>>> [borg01x153:01290] [[52298,0],4] assigned IPv4 port 38028 > >>>> [borg01x143:13793] mca: base: components_open: opening oob components > >>>> [borg01x143:13793] mca: base: components_open: found loaded component > tcp > >>>> [borg01x143:13793] mca: base: components_open: component tcp open > function successful > >>>> [borg01x143:13793] mca:oob:select: checking available component tcp > >>>> [borg01x143:13793] mca:oob:select: Querying component [tcp] > >>>> [borg01x143:13793] oob:tcp: component_available called > >>>> [borg01x143:13793] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > >>>> [borg01x143:13793] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > >>>> [borg01x143:13793] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > >>>> [borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.1.25.143 to > our list of V4 connections > >>>> [borg01x143:13793] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > >>>> [borg01x143:13793] [[52298,0],1] oob:tcp:init adding 172.31.1.254 to > our list of V4 connections > >>>> [borg01x143:13793] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > >>>> [borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.12.25.143 to > our list of V4 connections > >>>> [borg01x143:13793] [[52298,0],1] TCP STARTUP > >>>> [borg01x143:13793] [[52298,0],1] attempting to bind to IPv4 port 0 > >>>> [borg01x153:01290] mca:oob:select: Adding component to end > >>>> [borg01x153:01290] mca:oob:select: Found 1 active transports > >>>> [borg01x143:13793] [[52298,0],1] assigned IPv4 port 44719 > >>>> [borg01x143:13793] mca:oob:select: Adding component to end > >>>> [borg01x143:13793] mca:oob:select: Found 1 active transports > >>>> [borg01x144:30878] mca: base: components_register: registering oob > components > >>>> [borg01x144:30878] mca: base: components_register: found loaded > component tcp > >>>> [borg01x144:30878] mca: base: components_register: component tcp > register function successful > >>>> [borg01x144:30878] mca: base: components_open: opening oob components > >>>> [borg01x144:30878] mca: base: components_open: found loaded component > tcp > >>>> [borg01x144:30878] mca: base: components_open: component tcp open > function successful > >>>> [borg01x144:30878] mca:oob:select: checking available component tcp > >>>> [borg01x144:30878] mca:oob:select: Querying component [tcp] > >>>> [borg01x144:30878] oob:tcp: component_available called > >>>> [borg01x144:30878] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > >>>> [borg01x144:30878] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > >>>> [borg01x144:30878] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > >>>> [borg01x144:30878] [[52298,0],2] oob:tcp:init adding 10.1.25.144 to > our list of V4 connections > >>>> [borg01x144:30878] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > >>>> [borg01x144:30878] [[52298,0],2] oob:tcp:init adding 172.31.1.254 to > our list of V4 connections > >>>> [borg01x144:30878] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > >>>> [borg01x144:30878] [[52298,0],2] oob:tcp:init adding 10.12.25.144 to > our list of V4 connections > >>>> [borg01x144:30878] [[52298,0],2] TCP STARTUP > >>>> [borg01x144:30878] [[52298,0],2] attempting to bind to IPv4 port 0 > >>>> [borg01x144:30878] [[52298,0],2] assigned IPv4 port 40700 > >>>> [borg01x144:30878] mca:oob:select: Adding component to end > >>>> [borg01x144:30878] mca:oob:select: Found 1 active transports > >>>> [borg01x154:01154] mca: base: components_register: registering oob > components > >>>> [borg01x154:01154] mca: base: components_register: found loaded > component tcp > >>>> [borg01x154:01154] mca: base: components_register: component tcp > register function successful > >>>> [borg01x154:01154] mca: base: components_open: opening oob components > >>>> [borg01x154:01154] mca: base: components_open: found loaded component > tcp > >>>> [borg01x154:01154] mca: base: components_open: component tcp open > function successful > >>>> [borg01x154:01154] mca:oob:select: checking available component tcp > >>>> [borg01x154:01154] mca:oob:select: Querying component [tcp] > >>>> [borg01x154:01154] oob:tcp: component_available called > >>>> [borg01x154:01154] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > >>>> [borg01x154:01154] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > >>>> [borg01x154:01154] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > >>>> [borg01x154:01154] [[52298,0],5] oob:tcp:init adding 10.1.25.154 to > our list of V4 connections > >>>> [borg01x154:01154] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > >>>> [borg01x154:01154] [[52298,0],5] oob:tcp:init adding 172.31.1.254 to > our list of V4 connections > >>>> [borg01x154:01154] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > >>>> [borg01x154:01154] [[52298,0],5] oob:tcp:init adding 10.12.25.154 to > our list of V4 connections > >>>> [borg01x154:01154] [[52298,0],5] TCP STARTUP > >>>> [borg01x154:01154] [[52298,0],5] attempting to bind to IPv4 port 0 > >>>> [borg01x154:01154] [[52298,0],5] assigned IPv4 port 41191 > >>>> [borg01x154:01154] mca:oob:select: Adding component to end > >>>> [borg01x154:01154] mca:oob:select: Found 1 active transports > >>>> [borg01x145:02419] mca: base: components_register: registering oob > components > >>>> [borg01x145:02419] mca: base: components_register: found loaded > component tcp > >>>> [borg01x145:02419] mca: base: components_register: component tcp > register function successful > >>>> [borg01x145:02419] mca: base: components_open: opening oob components > >>>> [borg01x145:02419] mca: base: components_open: found loaded component > tcp > >>>> [borg01x145:02419] mca: base: components_open: component tcp open > function successful > >>>> [borg01x145:02419] mca:oob:select: checking available component tcp > >>>> [borg01x145:02419] mca:oob:select: Querying component [tcp] > >>>> [borg01x145:02419] oob:tcp: component_available called > >>>> [borg01x145:02419] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > >>>> [borg01x145:02419] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > >>>> [borg01x145:02419] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4 > >>>> [borg01x145:02419] [[52298,0],3] oob:tcp:init adding 10.1.25.145 to > our list of V4 connections > >>>> [borg01x145:02419] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4 > >>>> [borg01x145:02419] [[52298,0],3] oob:tcp:init adding 172.31.1.254 to > our list of V4 connections > >>>> [borg01x145:02419] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4 > >>>> [borg01x145:02419] [[52298,0],3] oob:tcp:init adding 10.12.25.145 to > our list of V4 connections > >>>> [borg01x145:02419] [[52298,0],3] TCP STARTUP > >>>> [borg01x145:02419] [[52298,0],3] attempting to bind to IPv4 port 0 > >>>> [borg01x145:02419] [[52298,0],3] assigned IPv4 port 51079 > >>>> [borg01x145:02419] mca:oob:select: Adding component to end > >>>> [borg01x145:02419] mca:oob:select: Found 1 active transports > >>>> [borg01x144:30878] [[52298,0],2] ORTE_ERROR_LOG: Bad parameter in > file base/rml_base_contact.c at line 161 > >>>> [borg01x144:30878] [[52298,0],2] ORTE_ERROR_LOG: Bad parameter in > file routed_binomial.c at line 498 > >>>> [borg01x144:30878] [[52298,0],2] ORTE_ERROR_LOG: Bad parameter in > file base/ess_base_std_orted.c at line 539 > >>>> srun.slurm: error: borg01x143: task 0: Exited with exit code 213 > >>>> srun.slurm: Terminating job step 2332583.24 > >>>> slurmd[borg01x144]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 > WITH SIGNAL 9 *** > >>>> srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to > finish. > >>>> srun.slurm: error: borg01x153: task 3: Exited with exit code 213 > >>>> [borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in > file base/rml_base_contact.c at line 161 > >>>> [borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in > file routed_binomial.c at line 498 > >>>> [borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in > file base/ess_base_std_orted.c at line 539 > >>>> [borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in > file base/rml_base_contact.c at line 161 > >>>> [borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in > file routed_binomial.c at line 498 > >>>> [borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in > file base/ess_base_std_orted.c at line 539 > >>>> slurmd[borg01x144]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 > WITH SIGNAL 9 *** > >>>> srun.slurm: error: borg01x144: task 1: Exited with exit code 213 > >>>> [borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in > file base/rml_base_contact.c at line 161 > >>>> [borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in > file routed_binomial.c at line 498 > >>>> [borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in > file base/ess_base_std_orted.c at line 539 > >>>> slurmd[borg01x154]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 > WITH SIGNAL 9 *** > >>>> slurmd[borg01x154]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 > WITH SIGNAL 9 *** > >>>> srun.slurm: error: borg01x154: task 4: Exited with exit code 213 > >>>> srun.slurm: error: borg01x145: task 2: Exited with exit code 213 > >>>> [borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in > file base/rml_base_contact.c at line 161 > >>>> [borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in > file routed_binomial.c at line 498 > >>>> [borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in > file base/ess_base_std_orted.c at line 539 > >>>> slurmd[borg01x145]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 > WITH SIGNAL 9 *** > >>>> slurmd[borg01x145]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30 > WITH SIGNAL 9 *** > >>>> sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:41686: No such file > or directory > >>>> [borg01x142:29232] [[52298,0],0] TCP SHUTDOWN > >>>> [borg01x142:29232] mca: base: close: component tcp closed > >>>> [borg01x142:29232] mca: base: close: unloading component tcp > >>>> > >>>> Note, if I can get the allocation today, I want to try doing all this > on a single SandyBridge node, rather than on 6. It might make comparing > various runs a bit easier! > >>>> > >>>> Matt > >>>> > >>>> > >>>> > >>>> On Fri, Aug 29, 2014 at 12:42 PM, Ralph Castain <r...@open-mpi.org> > wrote: > >>>> Okay, something quite weird is happening here. I can't replicate > using the 1.8.2 release tarball on a slurm machine, so my guess is that > something else is going on here. > >>>> > >>>> Could you please rebuild the 1.8.2 code with --enable-debug on the > configure line (assuming you haven't already done so), and then rerun that > version as before but adding "--mca oob_base_verbose 10" to the cmd line? > >>>> > >>>> > >>>> On Aug 29, 2014, at 4:22 AM, Matt Thompson <fort...@gmail.com> wrote: > >>>> > >>>>> Ralph, > >>>>> > >>>>> For 1.8.2rc4 I get: > >>>>> > >>>>> (1003) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun > --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x > >>>>> srun.slurm: cluster configuration lacks support for cpu binding > >>>>> srun.slurm: cluster configuration lacks support for cpu binding > >>>>> Daemon [[47143,0],5] checking in as pid 10990 on host borg01x154 > >>>>> [borg01x154:10990] [[47143,0],5] orted: up and running - waiting for > commands! > >>>>> Daemon [[47143,0],1] checking in as pid 23473 on host borg01x143 > >>>>> Daemon [[47143,0],2] checking in as pid 8250 on host borg01x144 > >>>>> [borg01x144:08250] [[47143,0],2] orted: up and running - waiting for > commands! > >>>>> [borg01x143:23473] [[47143,0],1] orted: up and running - waiting for > commands! > >>>>> Daemon [[47143,0],3] checking in as pid 12320 on host borg01x145 > >>>>> Daemon [[47143,0],4] checking in as pid 10902 on host borg01x153 > >>>>> [borg01x153:10902] [[47143,0],4] orted: up and running - waiting for > commands! > >>>>> [borg01x145:12320] [[47143,0],3] orted: up and running - waiting for > commands! > >>>>> [borg01x142:01629] [[47143,0],0] orted_cmd: received add_local_procs > >>>>> [borg01x144:08250] [[47143,0],2] orted_cmd: received add_local_procs > >>>>> [borg01x153:10902] [[47143,0],4] orted_cmd: received add_local_procs > >>>>> [borg01x143:23473] [[47143,0],1] orted_cmd: received add_local_procs > >>>>> [borg01x145:12320] [[47143,0],3] orted_cmd: received add_local_procs > >>>>> [borg01x154:10990] [[47143,0],5] orted_cmd: received add_local_procs > >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap > from local proc [[47143,1],0] > >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap > from local proc [[47143,1],2] > >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap > from local proc [[47143,1],3] > >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap > from local proc [[47143,1],1] > >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap > from local proc [[47143,1],5] > >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap > from local proc [[47143,1],4] > >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap > from local proc [[47143,1],6] > >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap > from local proc [[47143,1],7] > >>>>> MPIR_being_debugged = 0 > >>>>> MPIR_debug_state = 1 > >>>>> MPIR_partial_attach_ok = 1 > >>>>> MPIR_i_am_starter = 0 > >>>>> MPIR_forward_output = 0 > >>>>> MPIR_proctable_size = 8 > >>>>> MPIR_proctable: > >>>>> (i, host, exe, pid) = (0, borg01x142, > /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1647) > >>>>> (i, host, exe, pid) = (1, borg01x142, > /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1648) > >>>>> (i, host, exe, pid) = (2, borg01x142, > /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1650) > >>>>> (i, host, exe, pid) = (3, borg01x142, > /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1652) > >>>>> (i, host, exe, pid) = (4, borg01x142, > /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1654) > >>>>> (i, host, exe, pid) = (5, borg01x142, > /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1656) > >>>>> (i, host, exe, pid) = (6, borg01x142, > /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1658) > >>>>> (i, host, exe, pid) = (7, borg01x142, > /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1660) > >>>>> MPIR_executable_path: NULL > >>>>> MPIR_server_arguments: NULL > >>>>> [borg01x142:01629] [[47143,0],0] orted_cmd: received > message_local_procs > >>>>> [borg01x144:08250] [[47143,0],2] orted_cmd: received > message_local_procs > >>>>> [borg01x143:23473] [[47143,0],1] orted_cmd: received > message_local_procs > >>>>> [borg01x153:10902] [[47143,0],4] orted_cmd: received > message_local_procs > >>>>> [borg01x154:10990] [[47143,0],5] orted_cmd: received > message_local_procs > >>>>> [borg01x145:12320] [[47143,0],3] orted_cmd: received > message_local_procs > >>>>> [borg01x142:01629] [[47143,0],0] orted_cmd: received > message_local_procs > >>>>> [borg01x143:23473] [[47143,0],1] orted_cmd: received > message_local_procs > >>>>> [borg01x144:08250] [[47143,0],2] orted_cmd: received > message_local_procs > >>>>> [borg01x153:10902] [[47143,0],4] orted_cmd: received > message_local_procs > >>>>> [borg01x145:12320] [[47143,0],3] orted_cmd: received > message_local_procs > >>>>> Process 2 of 8 is on borg01x142 > >>>>> Process 5 of 8 is on borg01x142 > >>>>> Process 4 of 8 is on borg01x142 > >>>>> Process 1 of 8 is on borg01x142 > >>>>> Process 0 of 8 is on borg01x142 > >>>>> Process 3 of 8 is on borg01x142 > >>>>> Process 6 of 8 is on borg01x142 > >>>>> Process 7 of 8 is on borg01x142 > >>>>> [borg01x154:10990] [[47143,0],5] orted_cmd: received > message_local_procs > >>>>> [borg01x142:01629] [[47143,0],0] orted_cmd: received > message_local_procs > >>>>> [borg01x144:08250] [[47143,0],2] orted_cmd: received > message_local_procs > >>>>> [borg01x143:23473] [[47143,0],1] orted_cmd: received > message_local_procs > >>>>> [borg01x153:10902] [[47143,0],4] orted_cmd: received > message_local_procs > >>>>> [borg01x154:10990] [[47143,0],5] orted_cmd: received > message_local_procs > >>>>> [borg01x145:12320] [[47143,0],3] orted_cmd: received > message_local_procs > >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from > local proc [[47143,1],2] > >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from > local proc [[47143,1],1] > >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from > local proc [[47143,1],3] > >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from > local proc [[47143,1],0] > >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from > local proc [[47143,1],4] > >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from > local proc [[47143,1],6] > >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from > local proc [[47143,1],5] > >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from > local proc [[47143,1],7] > >>>>> [borg01x142:01629] [[47143,0],0] orted_cmd: received exit cmd > >>>>> [borg01x144:08250] [[47143,0],2] orted_cmd: received exit cmd > >>>>> [borg01x144:08250] [[47143,0],2] orted_cmd: all routes and children > gone - exiting > >>>>> [borg01x153:10902] [[47143,0],4] orted_cmd: received exit cmd > >>>>> [borg01x153:10902] [[47143,0],4] orted_cmd: all routes and children > gone - exiting > >>>>> [borg01x143:23473] [[47143,0],1] orted_cmd: received exit cmd > >>>>> [borg01x154:10990] [[47143,0],5] orted_cmd: received exit cmd > >>>>> [borg01x154:10990] [[47143,0],5] orted_cmd: all routes and children > gone - exiting > >>>>> [borg01x145:12320] [[47143,0],3] orted_cmd: received exit cmd > >>>>> [borg01x145:12320] [[47143,0],3] orted_cmd: all routes and children > gone - exiting > >>>>> > >>>>> Using the 1.8.2 mpirun: > >>>>> > >>>>> (1004) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun > --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x > >>>>> srun.slurm: cluster configuration lacks support for cpu binding > >>>>> srun.slurm: cluster configuration lacks support for cpu binding > >>>>> [borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in > file base/rml_base_contact.c at line 161 > >>>>> [borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in > file routed_binomial.c at line 498 > >>>>> [borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in > file base/ess_base_std_orted.c at line 539 > >>>>> srun.slurm: error: borg01x143: task 0: Exited with exit code 213 > >>>>> srun.slurm: Terminating job step 2332583.4 > >>>>> [borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in > file base/rml_base_contact.c at line 161 > >>>>> [borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in > file routed_binomial.c at line 498 > >>>>> [borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in > file base/ess_base_std_orted.c at line 539 > >>>>> [borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in > file base/rml_base_contact.c at line 161 > >>>>> [borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in > file routed_binomial.c at line 498 > >>>>> [borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in > file base/ess_base_std_orted.c at line 539 > >>>>> srun.slurm: Job step aborted: Waiting up to 2 seconds for job step > to finish. > >>>>> slurmd[borg01x145]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 > WITH SIGNAL 9 *** > >>>>> slurmd[borg01x154]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 > WITH SIGNAL 9 *** > >>>>> slurmd[borg01x153]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 > WITH SIGNAL 9 *** > >>>>> slurmd[borg01x153]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 > WITH SIGNAL 9 *** > >>>>> srun.slurm: error: borg01x144: task 1: Exited with exit code 213 > >>>>> slurmd[borg01x144]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 > WITH SIGNAL 9 *** > >>>>> slurmd[borg01x144]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 > WITH SIGNAL 9 *** > >>>>> srun.slurm: error: borg01x153: task 3: Exited with exit code 213 > >>>>> slurmd[borg01x154]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 > WITH SIGNAL 9 *** > >>>>> slurmd[borg01x145]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 > WITH SIGNAL 9 *** > >>>>> srun.slurm: error: borg01x154: task 4: Killed > >>>>> srun.slurm: error: borg01x145: task 2: Killed > >>>>> sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:34169: No such file > or directory > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Thu, Aug 28, 2014 at 7:17 PM, Ralph Castain <r...@open-mpi.org> > wrote: > >>>>> I'm unaware of any changes to the Slurm integration between rc4 and > final release. It sounds like this might be something else going on - try > adding "--leave-session-attached --debug-daemons" to your 1.8.2 command > line and let's see if any errors get reported. > >>>>> > >>>>> > >>>>> On Aug 28, 2014, at 12:20 PM, Matt Thompson <fort...@gmail.com> > wrote: > >>>>> > >>>>>> Open MPI List, > >>>>>> > >>>>>> I recently encountered an odd bug with Open MPI 1.8.1 and GCC 4.9.1 > on our cluster (reported on this list), and decided to try it with 1.8.2. > However, we seem to be having an issue with Open MPI 1.8.2 and SLURM. Even > weirder, Open MPI 1.8.2rc4 doesn't show the bug. And the bug is: I get no > stdout with Open MPI 1.8.2. That is, HelloWorld doesn't work. > >>>>>> > >>>>>> To wit, our sysadmin has two tarballs: > >>>>>> > >>>>>> (1441) $ sha1sum openmpi-1.8.2rc4.tar.bz2 > >>>>>> 7e7496913c949451f546f22a1a159df25f8bb683 openmpi-1.8.2rc4.tar.bz2 > >>>>>> (1442) $ sha1sum openmpi-1.8.2.tar.gz > >>>>>> cf2b1e45575896f63367406c6c50574699d8b2e1 openmpi-1.8.2.tar.gz > >>>>>> > >>>>>> I then build each with a script in the method our sysadmin usually > does: > >>>>>> > >>>>>> #!/bin/sh > >>>>>> set -x > >>>>>> export > PREFIX=/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2 > >>>>>> export > LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/nlocal/slurm/2.6.3/lib64 > >>>>>> build() { > >>>>>> echo `pwd` > >>>>>> ./configure --with-slurm --disable-wrapper-rpath --enable-shared > --enable-mca-no-build=btl-usnic \ > >>>>>> CC=gcc CXX=g++ F77=gfortran FC=gfortran \ > >>>>>> CFLAGS="-mtune=generic -fPIC -m64" CXXFLAGS="-mtune=generic > -fPIC -m64" FFLAGS="-mtune=generic -fPIC -m64" \ > >>>>>> F77FLAGS="-mtune=generic -fPIC -m64" FCFLAGS="-mtune=generic > -fPIC -m64" F90FLAGS="-mtune=generic -fPIC -m64" \ > >>>>>> LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64" > CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include" LIBS="-lpciaccess" \ > >>>>>> --prefix=${PREFIX} 2>&1 | tee configure.1.8.2.log > >>>>>> make 2>&1 | tee make.1.8.2.log > >>>>>> make check 2>&1 | tee makecheck.1.8.2.log > >>>>>> make install 2>&1 | tee makeinstall.1.8.2.log > >>>>>> } > >>>>>> echo "calling build" > >>>>>> build > >>>>>> echo "exiting" > >>>>>> > >>>>>> The only difference between the two is '1.8.2' or '1.8.2rc4' in the > PREFIX and log file tees. Now, let us test. First, I grab some nodes with > slurm: > >>>>>> > >>>>>> $ salloc --nodes=6 --ntasks-per-node=16 --constraint=sand > --time=09:00:00 --account=g0620 --mail-type=BEGIN > >>>>>> > >>>>>> Once I get my nodes, I run with 1.8.2rc4: > >>>>>> > >>>>>> (1142) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpifort -o > helloWorld.182rc4.x helloWorld.F90 > >>>>>> (1143) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8 > ./helloWorld.182rc4.x > >>>>>> Process 0 of 8 is on borg01w044 > >>>>>> Process 5 of 8 is on borg01w044 > >>>>>> Process 3 of 8 is on borg01w044 > >>>>>> Process 7 of 8 is on borg01w044 > >>>>>> Process 1 of 8 is on borg01w044 > >>>>>> Process 2 of 8 is on borg01w044 > >>>>>> Process 4 of 8 is on borg01w044 > >>>>>> Process 6 of 8 is on borg01w044 > >>>>>> > >>>>>> Now 1.8.2: > >>>>>> > >>>>>> (1144) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpifort -o > helloWorld.182.x helloWorld.F90 > >>>>>> (1145) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun -np 8 > ./helloWorld.182.x > >>>>>> (1146) $ > >>>>>> > >>>>>> No output at all. But, if I take the helloWorld.x from 1.8.2 and > run it with 1.8.2rc4's mpirun: > >>>>>> > >>>>>> (1146) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8 > ./helloWorld.182.x > >>>>>> Process 5 of 8 is on borg01w044 > >>>>>> Process 7 of 8 is on borg01w044 > >>>>>> Process 2 of 8 is on borg01w044 > >>>>>> Process 4 of 8 is on borg01w044 > >>>>>> Process 1 of 8 is on borg01w044 > >>>>>> Process 3 of 8 is on borg01w044 > >>>>>> Process 6 of 8 is on borg01w044 > >>>>>> Process 0 of 8 is on borg01w044 > >>>>>> > >>>>>> So...any idea what is happening here? There did seem to be a few > SLURM related changes between the two tarballs involving /dev/null but it's > a bit above me to decipher. > >>>>>> > >>>>>> You can find the ompi_info, build, make, config, etc logs at these > links (they are ~300kB which is over the mailing list limit according to > the Open MPI web page): > >>>>>> > >>>>>> > https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2rc4-Output.tar.bz2 > >>>>>> https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2-Output.tar.bz2 > >>>>>> > >>>>>> Thank you for any help and please let me know if you need more > information, > >>>>>> Matt > >>>>>> > >>>>>> -- > >>>>>> "And, isn't sanity really just a one-trick pony anyway? I mean all > you > >>>>>> get is one trick: rational thinking. But when you're good and > crazy, > >>>>>> oooh, oooh, oooh, the sky is the limit!" -- The Tick > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> us...@open-mpi.org > >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25182.php > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25184.php > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> "And, isn't sanity really just a one-trick pony anyway? I mean all > you > >>>>> get is one trick: rational thinking. But when you're good and crazy, > >>>>> oooh, oooh, oooh, the sky is the limit!" -- The Tick > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25187.php > >>>> > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25193.php > >>>> > >>>> > >>>> > >>>> -- > >>>> "And, isn't sanity really just a one-trick pony anyway? I mean all you > >>>> get is one trick: rational thinking. But when you're good and crazy, > >>>> oooh, oooh, oooh, the sky is the limit!" -- The Tick > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25196.php > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25197.php > >>> > >>> > >>> > >>> -- > >>> "And, isn't sanity really just a one-trick pony anyway? I mean all you > >>> get is one trick: rational thinking. But when you're good and crazy, > >>> oooh, oooh, oooh, the sky is the limit!" -- The Tick > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25204.php > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25205.php > >> > >> > >> > >> -- > >> "And, isn't sanity really just a one-trick pony anyway? I mean all you > >> get is one trick: rational thinking. But when you're good and crazy, > >> oooh, oooh, oooh, the sky is the limit!" -- The Tick > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25210.php > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25211.php > > > > > > > > -- > > "And, isn't sanity really just a one-trick pony anyway? I mean all you > > get is one trick: rational thinking. But when you're good and crazy, > > oooh, oooh, oooh, the sky is the limit!" -- The Tick > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25219.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25232.php > -- "And, isn't sanity really just a one-trick pony anyway? I mean all you get is one trick: rational thinking. But when you're good and crazy, oooh, oooh, oooh, the sky is the limit!" -- The Tick