Jeff,

I tried your script and I saw:

(1027) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun
-np 8 ./script.sh
(1028) $

Now, the very first time I ran it, I think I might have noticed a blip of
orted on the nodes, but it disappeared fast. When I re-run the same
command, it just seems to exit immediately with nothing showing up.

If I use my "debug-patch" version, I see:

(1028) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch//bin/mpirun
-np 8 ./script.sh
hello world
hello world
hello world
hello world
hello world
hello world
hello world
hello world

And, well, it's there for 10 minutes, I'm guessing. If I ssh to another of
the nodes in my allocation:

(1005) $ ps aux | grep openmpi
mathomp4 20317  0.0  0.0  59952  4256 ?        S    09:17   0:00
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/orted
-mca orte_ess_jobid 1842544640 -mca orte_ess_vpid 1 -mca orte_ess_num_procs
6 -mca orte_hnp_uri 1842544640.0;tcp://10.1.24.169,172.31.1.254,
10.12.24.169:41684
mathomp4 20389  0.0  0.0   5524   844 pts/0    S+   09:19   0:00 grep
--color=auto openmpi


Matt


On Tue, Sep 2, 2014 at 5:35 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:

> Matt --
>
> We were discussing this issue on our weekly OMPI engineering call today.
>
> Can you check one thing for me?  With the un-edited 1.8.2 tarball
> installation, I see that you're getting no output for commands that you run
> -- but also no errors.
>
> Can you verify and see if your commands are actually *running*?  E.g, try:
>
> $ cat > script.sh <<EOF
> #!/bin/sh
> echo hello world
> sleep 600
> echo goodbye world
> EOF
> $ chmod +x script.sh
> $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
> $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-clean/bin/mpirun
> -np 8 script.sh
>
> and then go "ps" on the back-end nodes and see if there is an "orted"
> process and N "sleep 600" processes running on them.
>
> I'm *assuming* you won't see the "hello world" output.
>
> The purpose of this test is that I want to see if OMPI is just totally
> erring out and not even running your job (which is quite unlikely; OMPI
> should be much more noisy when this happens), or whether we're simply not
> seeing the stdout from the job.
>
> Thanks.
>
>
>
> On Sep 2, 2014, at 9:36 AM, Matt Thompson <fort...@gmail.com> wrote:
>
> > On that machine, it would be SLES 11 SP1. I think it's soon
> transitioning to SLES 11 SP3.
> >
> > I also use Open MPI on an RHEL 6.5 box (possibly soon to be RHEL 7).
> >
> >
> > On Mon, Sep 1, 2014 at 8:41 PM, Ralph Castain <r...@open-mpi.org> wrote:
> > Thanks - I expect we'll have to release 1.8.3 soon to fix this in case
> others have similar issues. Out of curiosity, what OS are you using?
> >
> >
> > On Sep 1, 2014, at 9:00 AM, Matt Thompson <fort...@gmail.com> wrote:
> >
> >> Ralph,
> >>
> >> Okay that seems to have done it here (well, minus the usual
> shmem_mmap_enable_nfs_warning that our system always generates):
> >>
> >> (1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
> >> (1034) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun
> -np 8 ./helloWorld.182-debug-patch.x
> >> Process    7 of    8 is on borg01w218
> >> Process    5 of    8 is on borg01w218
> >> Process    1 of    8 is on borg01w218
> >> Process    3 of    8 is on borg01w218
> >> Process    0 of    8 is on borg01w218
> >> Process    2 of    8 is on borg01w218
> >> Process    4 of    8 is on borg01w218
> >> Process    6 of    8 is on borg01w218
> >>
> >> I'll ask the admin to apply the patch locally...and wait for 1.8.3, I
> suppose.
> >>
> >> Thanks,
> >> Matt
> >>
> >> On Sun, Aug 31, 2014 at 10:08 AM, Ralph Castain <r...@open-mpi.org>
> wrote:
> >> Hmmm....I may see the problem. Would you be so kind as to apply the
> attached patch to your 1.8.2 code, rebuild, and try again?
> >>
> >> Much appreciate the help. Everyone's system is slightly different, and
> I think you've uncovered one of those differences.
> >> Ralph
> >>
> >>
> >>
> >> On Aug 31, 2014, at 6:25 AM, Matt Thompson <fort...@gmail.com> wrote:
> >>
> >>> Ralph,
> >>>
> >>> Sorry it took me a bit of time. Here you go:
> >>>
> >>> (1002) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
> --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca
> plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x
> >>> [borg01w063:03815] mca:base:select:(  plm) Querying component
> [isolated]
> >>> [borg01w063:03815] mca:base:select:(  plm) Query of component
> [isolated] set priority to 0
> >>> [borg01w063:03815] mca:base:select:(  plm) Querying component [rsh]
> >>> [borg01w063:03815] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
> rsh path NULL
> >>> [borg01w063:03815] mca:base:select:(  plm) Query of component [rsh]
> set priority to 10
> >>> [borg01w063:03815] mca:base:select:(  plm) Querying component [slurm]
> >>> [borg01w063:03815] [[INVALID],INVALID] plm:slurm: available for
> selection
> >>> [borg01w063:03815] mca:base:select:(  plm) Query of component [slurm]
> set priority to 75
> >>> [borg01w063:03815] mca:base:select:(  plm) Selected component [slurm]
> >>> [borg01w063:03815] plm:base:set_hnp_name: initial bias 3815 nodename
> hash 1757783593
> >>> [borg01w063:03815] plm:base:set_hnp_name: final jobfam 49163
> >>> [borg01w063:03815] mca: base: components_register: registering oob
> components
> >>> [borg01w063:03815] mca: base: components_register: found loaded
> component tcp
> >>> [borg01w063:03815] mca: base: components_register: component tcp
> register function successful
> >>> [borg01w063:03815] mca: base: components_open: opening oob components
> >>> [borg01w063:03815] mca: base: components_open: found loaded component
> tcp
> >>> [borg01w063:03815] mca: base: components_open: component tcp open
> function successful
> >>> [borg01w063:03815] mca:oob:select: checking available component tcp
> >>> [borg01w063:03815] mca:oob:select: Querying component [tcp]
> >>> [borg01w063:03815] oob:tcp: component_available called
> >>> [borg01w063:03815] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> >>> [borg01w063:03815] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> >>> [borg01w063:03815] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> >>> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.1.24.63 to our
> list of V4 connections
> >>> [borg01w063:03815] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> >>> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 172.31.1.254 to
> our list of V4 connections
> >>> [borg01w063:03815] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> >>> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.12.24.63 to
> our list of V4 connections
> >>> [borg01w063:03815] [[49163,0],0] TCP STARTUP
> >>> [borg01w063:03815] [[49163,0],0] attempting to bind to IPv4 port 0
> >>> [borg01w063:03815] [[49163,0],0] assigned IPv4 port 41373
> >>> [borg01w063:03815] mca:oob:select: Adding component to end
> >>> [borg01w063:03815] mca:oob:select: Found 1 active transports
> >>> [borg01w063:03815] [[49163,0],0] plm:base:receive start comm
> >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_job
> >>> [borg01w063:03815] [[49163,0],0] plm:slurm: LAUNCH DAEMONS CALLED
> >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm
> >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm creating map
> >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
> [[49163,0],1]
> >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new
> daemon [[49163,0],1] to node borg01w064
> >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
> [[49163,0],2]
> >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new
> daemon [[49163,0],2] to node borg01w065
> >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
> [[49163,0],3]
> >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new
> daemon [[49163,0],3] to node borg01w069
> >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
> [[49163,0],4]
> >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new
> daemon [[49163,0],4] to node borg01w070
> >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
> [[49163,0],5]
> >>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new
> daemon [[49163,0],5] to node borg01w071
> >>> [borg01w063:03815] [[49163,0],0] plm:slurm: launching on nodes
> borg01w064,borg01w065,borg01w069,borg01w070,borg01w071
> >>> [borg01w063:03815] [[49163,0],0] plm:slurm: Set
> prefix:/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug
> >>> [borg01w063:03815] [[49163,0],0] plm:slurm: final top-level argv:
> >>>     srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none
> --nodes=5 --nodelist=borg01w064,borg01w065,borg01w069,borg01w070,borg01w071
> --ntasks=5 orted -mca orte_debug_daemons 1 -mca orte_leave_session_attached
> 1 -mca orte_ess_jobid 3221946368 -mca orte_ess_vpid 1 -mca
> orte_ess_num_procs 6 -mca orte_hnp_uri 3221946368.0;tcp://10.1.24.63
> ,172.31.1.254,10.12.24.63:41373 --mca oob_base_verbose 10 -mca
> plm_base_verbose 5
> >>> [borg01w063:03815] [[49163,0],0] plm:slurm: reset PATH:
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin:/usr/local/other/SLES11/gcc/4.9.1/bin:/usr/local/other/SLES11.1/git/
> 1.8.5.2/libexec/git-core:/usr/local/other/SLES11.1/git/1.8.5.2/bin:/usr/local/other/SLES11/svn/1.6.17/bin:/usr/local/other/SLES11/tkcvs/8.2.3/gcc-4.3.2/bin:.:/home/mathomp4/bin:/home/mathomp4/cvstools:/discover/nobackup/projects/gmao/share/dasilva/opengrads/Contents:/usr/local/other/Htop/1.0/bin:/usr/local/other/SLES11/gnuplot/4.6.0/gcc-4.3.2/bin:/usr/local/other/SLES11/xpdf/3.03-gcc-4.3.2/bin:/home/mathomp4/src/pdtoolkit-3.16/x86_64/bin:/discover/nobackup/mathomp4/WavewatchIII-GMAO/bin:/discover/nobackup/mathomp4/WavewatchIII-GMAO/exe:/usr/local/other/pods:/usr/local/other/SLES11.1/R/3.1.0/gcc-4.3.4/lib64/R/bin:.:/home/mathomp4/bin:/home/mathomp4/cvstools:/discover/nobackup/projects/gmao/share/dasilva/opengrads/Contents:/usr/local/other/Htop/1.0/bin:/usr/local/other/SLES11/gnuplot/4.6
>
>  
> .0/gcc-4.3.2/bin:/usr/local/other/SLES11/xpdf/3.03-gcc-4.3.2/bin:/home/mathomp4/src/pdtoolkit-3.16/x86_64/bin:/discover/nobackup/mathomp4/WavewatchIII-GMAO/bin:/discover/nobackup/mathomp4/WavewatchIII-GMAO/exe:/usr/local/other/pods:/usr/local/other/SLES11.1/R/3.1.0/gcc-4.3.4/lib64/R/bin:/home/mathomp4/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/opt/kde3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/usr/slurm/bin
> >>> [borg01w063:03815] [[49163,0],0] plm:slurm: reset LD_LIBRARY_PATH:
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/lib:/usr/local/other/SLES11/gcc/4.9.1/lib64:/usr/local/other/SLES11.1/git/
> 1.8.5.2/lib:/usr/local/other/SLES11/svn/1.6.17/lib:/usr/local/other/SLES11/tkcvs/8.2.3/gcc-4.3.2/lib
> >>> srun.slurm: cluster configuration lacks support for cpu binding
> >>> srun.slurm: cluster configuration lacks support for cpu binding
> >>> [borg01w065:15893] mca: base: components_register: registering oob
> components
> >>> [borg01w065:15893] mca: base: components_register: found loaded
> component tcp
> >>> [borg01w065:15893] mca: base: components_register: component tcp
> register function successful
> >>> [borg01w065:15893] mca: base: components_open: opening oob components
> >>> [borg01w065:15893] mca: base: components_open: found loaded component
> tcp
> >>> [borg01w065:15893] mca: base: components_open: component tcp open
> function successful
> >>> [borg01w065:15893] mca:oob:select: checking available component tcp
> >>> [borg01w065:15893] mca:oob:select: Querying component [tcp]
> >>> [borg01w065:15893] oob:tcp: component_available called
> >>> [borg01w065:15893] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> >>> [borg01w065:15893] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> >>> [borg01w065:15893] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> >>> [borg01w065:15893] [[49163,0],2] oob:tcp:init adding 10.1.24.65 to our
> list of V4 connections
> >>> [borg01w065:15893] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> >>> [borg01w065:15893] [[49163,0],2] oob:tcp:init adding 172.31.1.254 to
> our list of V4 connections
> >>> [borg01w065:15893] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> >>> [borg01w065:15893] [[49163,0],2] oob:tcp:init adding 10.12.24.65 to
> our list of V4 connections
> >>> [borg01w065:15893] [[49163,0],2] TCP STARTUP
> >>> [borg01w065:15893] [[49163,0],2] attempting to bind to IPv4 port 0
> >>> [borg01w065:15893] [[49163,0],2] assigned IPv4 port 43456
> >>> [borg01w065:15893] mca:oob:select: Adding component to end
> >>> [borg01w065:15893] mca:oob:select: Found 1 active transports
> >>> [borg01w070:12645] mca: base: components_register: registering oob
> components
> >>> [borg01w070:12645] mca: base: components_register: found loaded
> component tcp
> >>> [borg01w070:12645] mca: base: components_register: component tcp
> register function successful
> >>> [borg01w070:12645] mca: base: components_open: opening oob components
> >>> [borg01w070:12645] mca: base: components_open: found loaded component
> tcp
> >>> [borg01w070:12645] mca: base: components_open: component tcp open
> function successful
> >>> [borg01w070:12645] mca:oob:select: checking available component tcp
> >>> [borg01w070:12645] mca:oob:select: Querying component [tcp]
> >>> [borg01w070:12645] oob:tcp: component_available called
> >>> [borg01w070:12645] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> >>> [borg01w070:12645] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> >>> [borg01w070:12645] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> >>> [borg01w070:12645] [[49163,0],4] oob:tcp:init adding 10.1.24.70 to our
> list of V4 connections
> >>> [borg01w070:12645] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> >>> [borg01w070:12645] [[49163,0],4] oob:tcp:init adding 172.31.1.254 to
> our list of V4 connections
> >>> [borg01w070:12645] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> >>> [borg01w070:12645] [[49163,0],4] oob:tcp:init adding 10.12.24.70 to
> our list of V4 connections
> >>> [borg01w070:12645] [[49163,0],4] TCP STARTUP
> >>> [borg01w070:12645] [[49163,0],4] attempting to bind to IPv4 port 0
> >>> [borg01w070:12645] [[49163,0],4] assigned IPv4 port 53062
> >>> [borg01w070:12645] mca:oob:select: Adding component to end
> >>> [borg01w070:12645] mca:oob:select: Found 1 active transports
> >>> [borg01w064:16565] mca: base: components_register: registering oob
> components
> >>> [borg01w064:16565] mca: base: components_register: found loaded
> component tcp
> >>> [borg01w064:16565] mca: base: components_register: component tcp
> register function successful
> >>> [borg01w071:14879] mca: base: components_register: registering oob
> components
> >>> [borg01w071:14879] mca: base: components_register: found loaded
> component tcp
> >>> [borg01w064:16565] mca: base: components_open: opening oob components
> >>> [borg01w064:16565] mca: base: components_open: found loaded component
> tcp
> >>> [borg01w064:16565] mca: base: components_open: component tcp open
> function successful
> >>> [borg01w064:16565] mca:oob:select: checking available component tcp
> >>> [borg01w064:16565] mca:oob:select: Querying component [tcp]
> >>> [borg01w064:16565] oob:tcp: component_available called
> >>> [borg01w064:16565] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> >>> [borg01w064:16565] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> >>> [borg01w064:16565] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> >>> [borg01w064:16565] [[49163,0],1] oob:tcp:init adding 10.1.24.64 to our
> list of V4 connections
> >>> [borg01w064:16565] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> >>> [borg01w064:16565] [[49163,0],1] oob:tcp:init adding 172.31.1.254 to
> our list of V4 connections
> >>> [borg01w064:16565] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> >>> [borg01w064:16565] [[49163,0],1] oob:tcp:init adding 10.12.24.64 to
> our list of V4 connections
> >>> [borg01w064:16565] [[49163,0],1] TCP STARTUP
> >>> [borg01w064:16565] [[49163,0],1] attempting to bind to IPv4 port 0
> >>> [borg01w064:16565] [[49163,0],1] assigned IPv4 port 43828
> >>> [borg01w064:16565] mca:oob:select: Adding component to end
> >>> [borg01w069:30276] mca: base: components_register: registering oob
> components
> >>> [borg01w069:30276] mca: base: components_register: found loaded
> component tcp
> >>> [borg01w071:14879] mca: base: components_register: component tcp
> register function successful
> >>> [borg01w069:30276] mca: base: components_register: component tcp
> register function successful
> >>> [borg01w071:14879] mca: base: components_open: opening oob components
> >>> [borg01w071:14879] mca: base: components_open: found loaded component
> tcp
> >>> [borg01w071:14879] mca: base: components_open: component tcp open
> function successful
> >>> [borg01w071:14879] mca:oob:select: checking available component tcp
> >>> [borg01w071:14879] mca:oob:select: Querying component [tcp]
> >>> [borg01w071:14879] oob:tcp: component_available called
> >>> [borg01w069:30276] mca: base: components_open: opening oob components
> >>> [borg01w069:30276] mca: base: components_open: found loaded component
> tcp
> >>> [borg01w069:30276] mca: base: components_open: component tcp open
> function successful
> >>> [borg01w071:14879] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> >>> [borg01w071:14879] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> >>> [borg01w071:14879] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> >>> [borg01w071:14879] [[49163,0],5] oob:tcp:init adding 10.1.24.71 to our
> list of V4 connections
> >>> [borg01w071:14879] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> >>> [borg01w069:30276] mca:oob:select: checking available component tcp
> >>> [borg01w069:30276] mca:oob:select: Querying component [tcp]
> >>> [borg01w069:30276] oob:tcp: component_available called
> >>> [borg01w071:14879] [[49163,0],5] oob:tcp:init adding 172.31.1.254 to
> our list of V4 connections
> >>> [borg01w071:14879] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> >>> [borg01w071:14879] [[49163,0],5] oob:tcp:init adding 10.12.24.71 to
> our list of V4 connections
> >>> [borg01w071:14879] [[49163,0],5] TCP STARTUP
> >>> [borg01w069:30276] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> >>> [borg01w069:30276] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> >>> [borg01w069:30276] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> >>> [borg01w069:30276] [[49163,0],3] oob:tcp:init adding 10.1.24.69 to our
> list of V4 connections
> >>> [borg01w069:30276] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> >>> [borg01w069:30276] [[49163,0],3] oob:tcp:init adding 172.31.1.254 to
> our list of V4 connections
> >>> [borg01w069:30276] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> >>> [borg01w069:30276] [[49163,0],3] oob:tcp:init adding 10.12.24.69 to
> our list of V4 connections
> >>> [borg01w069:30276] [[49163,0],3] TCP STARTUP
> >>> [borg01w071:14879] [[49163,0],5] attempting to bind to IPv4 port 0
> >>> [borg01w069:30276] [[49163,0],3] attempting to bind to IPv4 port 0
> >>> [borg01w069:30276] [[49163,0],3] assigned IPv4 port 39299
> >>> [borg01w064:16565] mca:oob:select: Found 1 active transports
> >>> [borg01w069:30276] mca:oob:select: Adding component to end
> >>> [borg01w069:30276] mca:oob:select: Found 1 active transports
> >>> [borg01w071:14879] [[49163,0],5] assigned IPv4 port 56113
> >>> [borg01w071:14879] mca:oob:select: Adding component to end
> >>> [borg01w071:14879] mca:oob:select: Found 1 active transports
> >>> srun.slurm: error: borg01w064: task 0: Exited with exit code 213
> >>> srun.slurm: Terminating job step 2347743.3
> >>> srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to
> finish.
> >>> [borg01w070:12645] [[49163,0],4] ORTE_ERROR_LOG: Bad parameter in file
> base/rml_base_contact.c at line 161
> >>> [borg01w070:12645] [[49163,0],4] ORTE_ERROR_LOG: Bad parameter in file
> routed_binomial.c at line 498
> >>> [borg01w070:12645] [[49163,0],4] ORTE_ERROR_LOG: Bad parameter in file
> base/ess_base_std_orted.c at line 539
> >>> [borg01w065:15893] [[49163,0],2] ORTE_ERROR_LOG: Bad parameter in file
> base/rml_base_contact.c at line 161
> >>> [borg01w065:15893] [[49163,0],2] ORTE_ERROR_LOG: Bad parameter in file
> routed_binomial.c at line 498
> >>> [borg01w065:15893] [[49163,0],2] ORTE_ERROR_LOG: Bad parameter in file
> base/ess_base_std_orted.c at line 539
> >>> slurmd[borg01w065]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17
> WITH SIGNAL 9 ***
> >>> slurmd[borg01w070]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17
> WITH SIGNAL 9 ***
> >>> [borg01w064:16565] [[49163,0],1] ORTE_ERROR_LOG: Bad parameter in file
> base/rml_base_contact.c at line 161
> >>> [borg01w064:16565] [[49163,0],1] ORTE_ERROR_LOG: Bad parameter in file
> routed_binomial.c at line 498
> >>> [borg01w064:16565] [[49163,0],1] ORTE_ERROR_LOG: Bad parameter in file
> base/ess_base_std_orted.c at line 539
> >>> [borg01w069:30276] [[49163,0],3] ORTE_ERROR_LOG: Bad parameter in file
> base/rml_base_contact.c at line 161
> >>> [borg01w069:30276] [[49163,0],3] ORTE_ERROR_LOG: Bad parameter in file
> routed_binomial.c at line 498
> >>> [borg01w069:30276] [[49163,0],3] ORTE_ERROR_LOG: Bad parameter in file
> base/ess_base_std_orted.c at line 539
> >>> slurmd[borg01w069]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17
> WITH SIGNAL 9 ***
> >>> [borg01w071:14879] [[49163,0],5] ORTE_ERROR_LOG: Bad parameter in file
> base/rml_base_contact.c at line 161
> >>> [borg01w071:14879] [[49163,0],5] ORTE_ERROR_LOG: Bad parameter in file
> routed_binomial.c at line 498
> >>> [borg01w071:14879] [[49163,0],5] ORTE_ERROR_LOG: Bad parameter in file
> base/ess_base_std_orted.c at line 539
> >>> slurmd[borg01w071]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17
> WITH SIGNAL 9 ***
> >>> slurmd[borg01w065]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17
> WITH SIGNAL 9 ***
> >>> slurmd[borg01w069]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17
> WITH SIGNAL 9 ***
> >>> slurmd[borg01w070]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17
> WITH SIGNAL 9 ***
> >>> slurmd[borg01w071]: *** STEP 2347743.3 KILLED AT 2014-08-31T09:24:17
> WITH SIGNAL 9 ***
> >>> srun.slurm: error: borg01w069: task 2: Exited with exit code 213
> >>> srun.slurm: error: borg01w065: task 1: Exited with exit code 213
> >>> srun.slurm: error: borg01w071: task 4: Exited with exit code 213
> >>> srun.slurm: error: borg01w070: task 3: Exited with exit code 213
> >>> sh: tcp://10.1.24.63,172.31.1.254,10.12.24.63:41373: No such file or
> directory
> >>> [borg01w063:03815] [[49163,0],0] plm:slurm: primary daemons complete!
> >>> [borg01w063:03815] [[49163,0],0] plm:base:receive stop comm
> >>> [borg01w063:03815] [[49163,0],0] TCP SHUTDOWN
> >>> [borg01w063:03815] mca: base: close: component tcp closed
> >>> [borg01w063:03815] mca: base: close: unloading component tcp
> >>>
> >>>
> >>>
> >>> On Fri, Aug 29, 2014 at 3:18 PM, Ralph Castain <r...@open-mpi.org>
> wrote:
> >>> Rats - I also need "-mca plm_base_verbose 5" on there so I can see the
> cmd line being executed. Can you add it?
> >>>
> >>>
> >>> On Aug 29, 2014, at 11:16 AM, Matt Thompson <fort...@gmail.com> wrote:
> >>>
> >>>> Ralph,
> >>>>
> >>>> Here you go:
> >>>>
> >>>> (1080) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
> --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -np 8
> ./helloWorld.182-debug.x
> >>>> [borg01x142:29232] mca: base: components_register: registering oob
> components
> >>>> [borg01x142:29232] mca: base: components_register: found loaded
> component tcp
> >>>> [borg01x142:29232] mca: base: components_register: component tcp
> register function successful
> >>>> [borg01x142:29232] mca: base: components_open: opening oob components
> >>>> [borg01x142:29232] mca: base: components_open: found loaded component
> tcp
> >>>> [borg01x142:29232] mca: base: components_open: component tcp open
> function successful
> >>>> [borg01x142:29232] mca:oob:select: checking available component tcp
> >>>> [borg01x142:29232] mca:oob:select: Querying component [tcp]
> >>>> [borg01x142:29232] oob:tcp: component_available called
> >>>> [borg01x142:29232] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> >>>> [borg01x142:29232] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> >>>> [borg01x142:29232] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> >>>> [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.1.25.142 to
> our list of V4 connections
> >>>> [borg01x142:29232] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> >>>> [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 172.31.1.254 to
> our list of V4 connections
> >>>> [borg01x142:29232] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> >>>> [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.12.25.142 to
> our list of V4 connections
> >>>> [borg01x142:29232] [[52298,0],0] TCP STARTUP
> >>>> [borg01x142:29232] [[52298,0],0] attempting to bind to IPv4 port 0
> >>>> [borg01x142:29232] [[52298,0],0] assigned IPv4 port 41686
> >>>> [borg01x142:29232] mca:oob:select: Adding component to end
> >>>> [borg01x142:29232] mca:oob:select: Found 1 active transports
> >>>> srun.slurm: cluster configuration lacks support for cpu binding
> >>>> srun.slurm: cluster configuration lacks support for cpu binding
> >>>> [borg01x153:01290] mca: base: components_register: registering oob
> components
> >>>> [borg01x153:01290] mca: base: components_register: found loaded
> component tcp
> >>>> [borg01x143:13793] mca: base: components_register: registering oob
> components
> >>>> [borg01x143:13793] mca: base: components_register: found loaded
> component tcp
> >>>> [borg01x153:01290] mca: base: components_register: component tcp
> register function successful
> >>>> [borg01x153:01290] mca: base: components_open: opening oob components
> >>>> [borg01x153:01290] mca: base: components_open: found loaded component
> tcp
> >>>> [borg01x153:01290] mca: base: components_open: component tcp open
> function successful
> >>>> [borg01x153:01290] mca:oob:select: checking available component tcp
> >>>> [borg01x153:01290] mca:oob:select: Querying component [tcp]
> >>>> [borg01x153:01290] oob:tcp: component_available called
> >>>> [borg01x153:01290] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> >>>> [borg01x153:01290] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> >>>> [borg01x153:01290] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> >>>> [borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.1.25.153 to
> our list of V4 connections
> >>>> [borg01x153:01290] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> >>>> [borg01x153:01290] [[52298,0],4] oob:tcp:init adding 172.31.1.254 to
> our list of V4 connections
> >>>> [borg01x153:01290] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> >>>> [borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.12.25.153 to
> our list of V4 connections
> >>>> [borg01x153:01290] [[52298,0],4] TCP STARTUP
> >>>> [borg01x153:01290] [[52298,0],4] attempting to bind to IPv4 port 0
> >>>> [borg01x143:13793] mca: base: components_register: component tcp
> register function successful
> >>>> [borg01x153:01290] [[52298,0],4] assigned IPv4 port 38028
> >>>> [borg01x143:13793] mca: base: components_open: opening oob components
> >>>> [borg01x143:13793] mca: base: components_open: found loaded component
> tcp
> >>>> [borg01x143:13793] mca: base: components_open: component tcp open
> function successful
> >>>> [borg01x143:13793] mca:oob:select: checking available component tcp
> >>>> [borg01x143:13793] mca:oob:select: Querying component [tcp]
> >>>> [borg01x143:13793] oob:tcp: component_available called
> >>>> [borg01x143:13793] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> >>>> [borg01x143:13793] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> >>>> [borg01x143:13793] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> >>>> [borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.1.25.143 to
> our list of V4 connections
> >>>> [borg01x143:13793] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> >>>> [borg01x143:13793] [[52298,0],1] oob:tcp:init adding 172.31.1.254 to
> our list of V4 connections
> >>>> [borg01x143:13793] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> >>>> [borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.12.25.143 to
> our list of V4 connections
> >>>> [borg01x143:13793] [[52298,0],1] TCP STARTUP
> >>>> [borg01x143:13793] [[52298,0],1] attempting to bind to IPv4 port 0
> >>>> [borg01x153:01290] mca:oob:select: Adding component to end
> >>>> [borg01x153:01290] mca:oob:select: Found 1 active transports
> >>>> [borg01x143:13793] [[52298,0],1] assigned IPv4 port 44719
> >>>> [borg01x143:13793] mca:oob:select: Adding component to end
> >>>> [borg01x143:13793] mca:oob:select: Found 1 active transports
> >>>> [borg01x144:30878] mca: base: components_register: registering oob
> components
> >>>> [borg01x144:30878] mca: base: components_register: found loaded
> component tcp
> >>>> [borg01x144:30878] mca: base: components_register: component tcp
> register function successful
> >>>> [borg01x144:30878] mca: base: components_open: opening oob components
> >>>> [borg01x144:30878] mca: base: components_open: found loaded component
> tcp
> >>>> [borg01x144:30878] mca: base: components_open: component tcp open
> function successful
> >>>> [borg01x144:30878] mca:oob:select: checking available component tcp
> >>>> [borg01x144:30878] mca:oob:select: Querying component [tcp]
> >>>> [borg01x144:30878] oob:tcp: component_available called
> >>>> [borg01x144:30878] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> >>>> [borg01x144:30878] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> >>>> [borg01x144:30878] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> >>>> [borg01x144:30878] [[52298,0],2] oob:tcp:init adding 10.1.25.144 to
> our list of V4 connections
> >>>> [borg01x144:30878] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> >>>> [borg01x144:30878] [[52298,0],2] oob:tcp:init adding 172.31.1.254 to
> our list of V4 connections
> >>>> [borg01x144:30878] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> >>>> [borg01x144:30878] [[52298,0],2] oob:tcp:init adding 10.12.25.144 to
> our list of V4 connections
> >>>> [borg01x144:30878] [[52298,0],2] TCP STARTUP
> >>>> [borg01x144:30878] [[52298,0],2] attempting to bind to IPv4 port 0
> >>>> [borg01x144:30878] [[52298,0],2] assigned IPv4 port 40700
> >>>> [borg01x144:30878] mca:oob:select: Adding component to end
> >>>> [borg01x144:30878] mca:oob:select: Found 1 active transports
> >>>> [borg01x154:01154] mca: base: components_register: registering oob
> components
> >>>> [borg01x154:01154] mca: base: components_register: found loaded
> component tcp
> >>>> [borg01x154:01154] mca: base: components_register: component tcp
> register function successful
> >>>> [borg01x154:01154] mca: base: components_open: opening oob components
> >>>> [borg01x154:01154] mca: base: components_open: found loaded component
> tcp
> >>>> [borg01x154:01154] mca: base: components_open: component tcp open
> function successful
> >>>> [borg01x154:01154] mca:oob:select: checking available component tcp
> >>>> [borg01x154:01154] mca:oob:select: Querying component [tcp]
> >>>> [borg01x154:01154] oob:tcp: component_available called
> >>>> [borg01x154:01154] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> >>>> [borg01x154:01154] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> >>>> [borg01x154:01154] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> >>>> [borg01x154:01154] [[52298,0],5] oob:tcp:init adding 10.1.25.154 to
> our list of V4 connections
> >>>> [borg01x154:01154] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> >>>> [borg01x154:01154] [[52298,0],5] oob:tcp:init adding 172.31.1.254 to
> our list of V4 connections
> >>>> [borg01x154:01154] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> >>>> [borg01x154:01154] [[52298,0],5] oob:tcp:init adding 10.12.25.154 to
> our list of V4 connections
> >>>> [borg01x154:01154] [[52298,0],5] TCP STARTUP
> >>>> [borg01x154:01154] [[52298,0],5] attempting to bind to IPv4 port 0
> >>>> [borg01x154:01154] [[52298,0],5] assigned IPv4 port 41191
> >>>> [borg01x154:01154] mca:oob:select: Adding component to end
> >>>> [borg01x154:01154] mca:oob:select: Found 1 active transports
> >>>> [borg01x145:02419] mca: base: components_register: registering oob
> components
> >>>> [borg01x145:02419] mca: base: components_register: found loaded
> component tcp
> >>>> [borg01x145:02419] mca: base: components_register: component tcp
> register function successful
> >>>> [borg01x145:02419] mca: base: components_open: opening oob components
> >>>> [borg01x145:02419] mca: base: components_open: found loaded component
> tcp
> >>>> [borg01x145:02419] mca: base: components_open: component tcp open
> function successful
> >>>> [borg01x145:02419] mca:oob:select: checking available component tcp
> >>>> [borg01x145:02419] mca:oob:select: Querying component [tcp]
> >>>> [borg01x145:02419] oob:tcp: component_available called
> >>>> [borg01x145:02419] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> >>>> [borg01x145:02419] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> >>>> [borg01x145:02419] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> >>>> [borg01x145:02419] [[52298,0],3] oob:tcp:init adding 10.1.25.145 to
> our list of V4 connections
> >>>> [borg01x145:02419] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> >>>> [borg01x145:02419] [[52298,0],3] oob:tcp:init adding 172.31.1.254 to
> our list of V4 connections
> >>>> [borg01x145:02419] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> >>>> [borg01x145:02419] [[52298,0],3] oob:tcp:init adding 10.12.25.145 to
> our list of V4 connections
> >>>> [borg01x145:02419] [[52298,0],3] TCP STARTUP
> >>>> [borg01x145:02419] [[52298,0],3] attempting to bind to IPv4 port 0
> >>>> [borg01x145:02419] [[52298,0],3] assigned IPv4 port 51079
> >>>> [borg01x145:02419] mca:oob:select: Adding component to end
> >>>> [borg01x145:02419] mca:oob:select: Found 1 active transports
> >>>> [borg01x144:30878] [[52298,0],2] ORTE_ERROR_LOG: Bad parameter in
> file base/rml_base_contact.c at line 161
> >>>> [borg01x144:30878] [[52298,0],2] ORTE_ERROR_LOG: Bad parameter in
> file routed_binomial.c at line 498
> >>>> [borg01x144:30878] [[52298,0],2] ORTE_ERROR_LOG: Bad parameter in
> file base/ess_base_std_orted.c at line 539
> >>>> srun.slurm: error: borg01x143: task 0: Exited with exit code 213
> >>>> srun.slurm: Terminating job step 2332583.24
> >>>> slurmd[borg01x144]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30
> WITH SIGNAL 9 ***
> >>>> srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to
> finish.
> >>>> srun.slurm: error: borg01x153: task 3: Exited with exit code 213
> >>>> [borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in
> file base/rml_base_contact.c at line 161
> >>>> [borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in
> file routed_binomial.c at line 498
> >>>> [borg01x153:01290] [[52298,0],4] ORTE_ERROR_LOG: Bad parameter in
> file base/ess_base_std_orted.c at line 539
> >>>> [borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in
> file base/rml_base_contact.c at line 161
> >>>> [borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in
> file routed_binomial.c at line 498
> >>>> [borg01x143:13793] [[52298,0],1] ORTE_ERROR_LOG: Bad parameter in
> file base/ess_base_std_orted.c at line 539
> >>>> slurmd[borg01x144]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30
> WITH SIGNAL 9 ***
> >>>> srun.slurm: error: borg01x144: task 1: Exited with exit code 213
> >>>> [borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in
> file base/rml_base_contact.c at line 161
> >>>> [borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in
> file routed_binomial.c at line 498
> >>>> [borg01x154:01154] [[52298,0],5] ORTE_ERROR_LOG: Bad parameter in
> file base/ess_base_std_orted.c at line 539
> >>>> slurmd[borg01x154]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30
> WITH SIGNAL 9 ***
> >>>> slurmd[borg01x154]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30
> WITH SIGNAL 9 ***
> >>>> srun.slurm: error: borg01x154: task 4: Exited with exit code 213
> >>>> srun.slurm: error: borg01x145: task 2: Exited with exit code 213
> >>>> [borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in
> file base/rml_base_contact.c at line 161
> >>>> [borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in
> file routed_binomial.c at line 498
> >>>> [borg01x145:02419] [[52298,0],3] ORTE_ERROR_LOG: Bad parameter in
> file base/ess_base_std_orted.c at line 539
> >>>> slurmd[borg01x145]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30
> WITH SIGNAL 9 ***
> >>>> slurmd[borg01x145]: *** STEP 2332583.24 KILLED AT 2014-08-29T13:59:30
> WITH SIGNAL 9 ***
> >>>> sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:41686: No such file
> or directory
> >>>> [borg01x142:29232] [[52298,0],0] TCP SHUTDOWN
> >>>> [borg01x142:29232] mca: base: close: component tcp closed
> >>>> [borg01x142:29232] mca: base: close: unloading component tcp
> >>>>
> >>>> Note, if I can get the allocation today, I want to try doing all this
> on a single SandyBridge node, rather than on 6. It might make comparing
> various runs a bit easier!
> >>>>
> >>>> Matt
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Aug 29, 2014 at 12:42 PM, Ralph Castain <r...@open-mpi.org>
> wrote:
> >>>> Okay, something quite weird is happening here. I can't replicate
> using the 1.8.2 release tarball on a slurm machine, so my guess is that
> something else is going on here.
> >>>>
> >>>> Could you please rebuild the 1.8.2 code with --enable-debug on the
> configure line (assuming you haven't already done so), and then rerun that
> version as before but adding "--mca oob_base_verbose 10" to the cmd line?
> >>>>
> >>>>
> >>>> On Aug 29, 2014, at 4:22 AM, Matt Thompson <fort...@gmail.com> wrote:
> >>>>
> >>>>> Ralph,
> >>>>>
> >>>>> For 1.8.2rc4 I get:
> >>>>>
> >>>>> (1003) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun
> --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x
> >>>>> srun.slurm: cluster configuration lacks support for cpu binding
> >>>>> srun.slurm: cluster configuration lacks support for cpu binding
> >>>>> Daemon [[47143,0],5] checking in as pid 10990 on host borg01x154
> >>>>> [borg01x154:10990] [[47143,0],5] orted: up and running - waiting for
> commands!
> >>>>> Daemon [[47143,0],1] checking in as pid 23473 on host borg01x143
> >>>>> Daemon [[47143,0],2] checking in as pid 8250 on host borg01x144
> >>>>> [borg01x144:08250] [[47143,0],2] orted: up and running - waiting for
> commands!
> >>>>> [borg01x143:23473] [[47143,0],1] orted: up and running - waiting for
> commands!
> >>>>> Daemon [[47143,0],3] checking in as pid 12320 on host borg01x145
> >>>>> Daemon [[47143,0],4] checking in as pid 10902 on host borg01x153
> >>>>> [borg01x153:10902] [[47143,0],4] orted: up and running - waiting for
> commands!
> >>>>> [borg01x145:12320] [[47143,0],3] orted: up and running - waiting for
> commands!
> >>>>> [borg01x142:01629] [[47143,0],0] orted_cmd: received add_local_procs
> >>>>> [borg01x144:08250] [[47143,0],2] orted_cmd: received add_local_procs
> >>>>> [borg01x153:10902] [[47143,0],4] orted_cmd: received add_local_procs
> >>>>> [borg01x143:23473] [[47143,0],1] orted_cmd: received add_local_procs
> >>>>> [borg01x145:12320] [[47143,0],3] orted_cmd: received add_local_procs
> >>>>> [borg01x154:10990] [[47143,0],5] orted_cmd: received add_local_procs
> >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap
> from local proc [[47143,1],0]
> >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap
> from local proc [[47143,1],2]
> >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap
> from local proc [[47143,1],3]
> >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap
> from local proc [[47143,1],1]
> >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap
> from local proc [[47143,1],5]
> >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap
> from local proc [[47143,1],4]
> >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap
> from local proc [[47143,1],6]
> >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap
> from local proc [[47143,1],7]
> >>>>>   MPIR_being_debugged = 0
> >>>>>   MPIR_debug_state = 1
> >>>>>   MPIR_partial_attach_ok = 1
> >>>>>   MPIR_i_am_starter = 0
> >>>>>   MPIR_forward_output = 0
> >>>>>   MPIR_proctable_size = 8
> >>>>>   MPIR_proctable:
> >>>>>     (i, host, exe, pid) = (0, borg01x142,
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1647)
> >>>>>     (i, host, exe, pid) = (1, borg01x142,
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1648)
> >>>>>     (i, host, exe, pid) = (2, borg01x142,
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1650)
> >>>>>     (i, host, exe, pid) = (3, borg01x142,
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1652)
> >>>>>     (i, host, exe, pid) = (4, borg01x142,
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1654)
> >>>>>     (i, host, exe, pid) = (5, borg01x142,
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1656)
> >>>>>     (i, host, exe, pid) = (6, borg01x142,
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1658)
> >>>>>     (i, host, exe, pid) = (7, borg01x142,
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1660)
> >>>>> MPIR_executable_path: NULL
> >>>>> MPIR_server_arguments: NULL
> >>>>> [borg01x142:01629] [[47143,0],0] orted_cmd: received
> message_local_procs
> >>>>> [borg01x144:08250] [[47143,0],2] orted_cmd: received
> message_local_procs
> >>>>> [borg01x143:23473] [[47143,0],1] orted_cmd: received
> message_local_procs
> >>>>> [borg01x153:10902] [[47143,0],4] orted_cmd: received
> message_local_procs
> >>>>> [borg01x154:10990] [[47143,0],5] orted_cmd: received
> message_local_procs
> >>>>> [borg01x145:12320] [[47143,0],3] orted_cmd: received
> message_local_procs
> >>>>> [borg01x142:01629] [[47143,0],0] orted_cmd: received
> message_local_procs
> >>>>> [borg01x143:23473] [[47143,0],1] orted_cmd: received
> message_local_procs
> >>>>> [borg01x144:08250] [[47143,0],2] orted_cmd: received
> message_local_procs
> >>>>> [borg01x153:10902] [[47143,0],4] orted_cmd: received
> message_local_procs
> >>>>> [borg01x145:12320] [[47143,0],3] orted_cmd: received
> message_local_procs
> >>>>> Process    2 of    8 is on borg01x142
> >>>>> Process    5 of    8 is on borg01x142
> >>>>> Process    4 of    8 is on borg01x142
> >>>>> Process    1 of    8 is on borg01x142
> >>>>> Process    0 of    8 is on borg01x142
> >>>>> Process    3 of    8 is on borg01x142
> >>>>> Process    6 of    8 is on borg01x142
> >>>>> Process    7 of    8 is on borg01x142
> >>>>> [borg01x154:10990] [[47143,0],5] orted_cmd: received
> message_local_procs
> >>>>> [borg01x142:01629] [[47143,0],0] orted_cmd: received
> message_local_procs
> >>>>> [borg01x144:08250] [[47143,0],2] orted_cmd: received
> message_local_procs
> >>>>> [borg01x143:23473] [[47143,0],1] orted_cmd: received
> message_local_procs
> >>>>> [borg01x153:10902] [[47143,0],4] orted_cmd: received
> message_local_procs
> >>>>> [borg01x154:10990] [[47143,0],5] orted_cmd: received
> message_local_procs
> >>>>> [borg01x145:12320] [[47143,0],3] orted_cmd: received
> message_local_procs
> >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from
> local proc [[47143,1],2]
> >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from
> local proc [[47143,1],1]
> >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from
> local proc [[47143,1],3]
> >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from
> local proc [[47143,1],0]
> >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from
> local proc [[47143,1],4]
> >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from
> local proc [[47143,1],6]
> >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from
> local proc [[47143,1],5]
> >>>>> [borg01x142:01629] [[47143,0],0] orted_recv: received sync from
> local proc [[47143,1],7]
> >>>>> [borg01x142:01629] [[47143,0],0] orted_cmd: received exit cmd
> >>>>> [borg01x144:08250] [[47143,0],2] orted_cmd: received exit cmd
> >>>>> [borg01x144:08250] [[47143,0],2] orted_cmd: all routes and children
> gone - exiting
> >>>>> [borg01x153:10902] [[47143,0],4] orted_cmd: received exit cmd
> >>>>> [borg01x153:10902] [[47143,0],4] orted_cmd: all routes and children
> gone - exiting
> >>>>> [borg01x143:23473] [[47143,0],1] orted_cmd: received exit cmd
> >>>>> [borg01x154:10990] [[47143,0],5] orted_cmd: received exit cmd
> >>>>> [borg01x154:10990] [[47143,0],5] orted_cmd: all routes and children
> gone - exiting
> >>>>> [borg01x145:12320] [[47143,0],3] orted_cmd: received exit cmd
> >>>>> [borg01x145:12320] [[47143,0],3] orted_cmd: all routes and children
> gone - exiting
> >>>>>
> >>>>> Using the 1.8.2 mpirun:
> >>>>>
> >>>>> (1004) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun
> --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x
> >>>>> srun.slurm: cluster configuration lacks support for cpu binding
> >>>>> srun.slurm: cluster configuration lacks support for cpu binding
> >>>>> [borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in
> file base/rml_base_contact.c at line 161
> >>>>> [borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in
> file routed_binomial.c at line 498
> >>>>> [borg01x143:23494] [[47330,0],1] ORTE_ERROR_LOG: Bad parameter in
> file base/ess_base_std_orted.c at line 539
> >>>>> srun.slurm: error: borg01x143: task 0: Exited with exit code 213
> >>>>> srun.slurm: Terminating job step 2332583.4
> >>>>> [borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in
> file base/rml_base_contact.c at line 161
> >>>>> [borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in
> file routed_binomial.c at line 498
> >>>>> [borg01x153:10915] [[47330,0],4] ORTE_ERROR_LOG: Bad parameter in
> file base/ess_base_std_orted.c at line 539
> >>>>> [borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in
> file base/rml_base_contact.c at line 161
> >>>>> [borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in
> file routed_binomial.c at line 498
> >>>>> [borg01x144:08263] [[47330,0],2] ORTE_ERROR_LOG: Bad parameter in
> file base/ess_base_std_orted.c at line 539
> >>>>> srun.slurm: Job step aborted: Waiting up to 2 seconds for job step
> to finish.
> >>>>> slurmd[borg01x145]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20
> WITH SIGNAL 9 ***
> >>>>> slurmd[borg01x154]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20
> WITH SIGNAL 9 ***
> >>>>> slurmd[borg01x153]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20
> WITH SIGNAL 9 ***
> >>>>> slurmd[borg01x153]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20
> WITH SIGNAL 9 ***
> >>>>> srun.slurm: error: borg01x144: task 1: Exited with exit code 213
> >>>>> slurmd[borg01x144]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20
> WITH SIGNAL 9 ***
> >>>>> slurmd[borg01x144]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20
> WITH SIGNAL 9 ***
> >>>>> srun.slurm: error: borg01x153: task 3: Exited with exit code 213
> >>>>> slurmd[borg01x154]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20
> WITH SIGNAL 9 ***
> >>>>> slurmd[borg01x145]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20
> WITH SIGNAL 9 ***
> >>>>> srun.slurm: error: borg01x154: task 4: Killed
> >>>>> srun.slurm: error: borg01x145: task 2: Killed
> >>>>> sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:34169: No such file
> or directory
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Aug 28, 2014 at 7:17 PM, Ralph Castain <r...@open-mpi.org>
> wrote:
> >>>>> I'm unaware of any changes to the Slurm integration between rc4 and
> final release. It sounds like this might be something else going on - try
> adding "--leave-session-attached --debug-daemons" to your 1.8.2 command
> line and let's see if any errors get reported.
> >>>>>
> >>>>>
> >>>>> On Aug 28, 2014, at 12:20 PM, Matt Thompson <fort...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Open MPI List,
> >>>>>>
> >>>>>> I recently encountered an odd bug with Open MPI 1.8.1 and GCC 4.9.1
> on our cluster (reported on this list), and decided to try it with 1.8.2.
> However, we seem to be having an issue with Open MPI 1.8.2 and SLURM. Even
> weirder, Open MPI 1.8.2rc4 doesn't show the bug. And the bug is: I get no
> stdout with Open MPI 1.8.2. That is, HelloWorld doesn't work.
> >>>>>>
> >>>>>> To wit, our sysadmin has two tarballs:
> >>>>>>
> >>>>>> (1441) $ sha1sum openmpi-1.8.2rc4.tar.bz2
> >>>>>> 7e7496913c949451f546f22a1a159df25f8bb683  openmpi-1.8.2rc4.tar.bz2
> >>>>>> (1442) $ sha1sum openmpi-1.8.2.tar.gz
> >>>>>> cf2b1e45575896f63367406c6c50574699d8b2e1  openmpi-1.8.2.tar.gz
> >>>>>>
> >>>>>> I then build each with a script in the method our sysadmin usually
> does:
> >>>>>>
> >>>>>> #!/bin/sh
> >>>>>> set -x
> >>>>>> export
> PREFIX=/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2
> >>>>>> export
> LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/nlocal/slurm/2.6.3/lib64
> >>>>>> build() {
> >>>>>>   echo `pwd`
> >>>>>>   ./configure --with-slurm --disable-wrapper-rpath --enable-shared
> --enable-mca-no-build=btl-usnic \
> >>>>>>       CC=gcc CXX=g++ F77=gfortran FC=gfortran \
> >>>>>>       CFLAGS="-mtune=generic -fPIC -m64" CXXFLAGS="-mtune=generic
> -fPIC -m64" FFLAGS="-mtune=generic -fPIC -m64" \
> >>>>>>       F77FLAGS="-mtune=generic -fPIC -m64" FCFLAGS="-mtune=generic
> -fPIC -m64" F90FLAGS="-mtune=generic -fPIC -m64" \
> >>>>>>       LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64"
> CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include" LIBS="-lpciaccess" \
> >>>>>>      --prefix=${PREFIX} 2>&1 | tee configure.1.8.2.log
> >>>>>>   make 2>&1 | tee make.1.8.2.log
> >>>>>>   make check 2>&1 | tee makecheck.1.8.2.log
> >>>>>>   make install 2>&1 | tee makeinstall.1.8.2.log
> >>>>>> }
> >>>>>> echo "calling build"
> >>>>>> build
> >>>>>> echo "exiting"
> >>>>>>
> >>>>>> The only difference between the two is '1.8.2' or '1.8.2rc4' in the
> PREFIX and log file tees.  Now, let us test. First, I grab some nodes with
> slurm:
> >>>>>>
> >>>>>> $ salloc --nodes=6 --ntasks-per-node=16 --constraint=sand
> --time=09:00:00 --account=g0620 --mail-type=BEGIN
> >>>>>>
> >>>>>> Once I get my nodes, I run with 1.8.2rc4:
> >>>>>>
> >>>>>> (1142) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpifort -o
> helloWorld.182rc4.x helloWorld.F90
> >>>>>> (1143) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8
> ./helloWorld.182rc4.x
> >>>>>> Process    0 of    8 is on borg01w044
> >>>>>> Process    5 of    8 is on borg01w044
> >>>>>> Process    3 of    8 is on borg01w044
> >>>>>> Process    7 of    8 is on borg01w044
> >>>>>> Process    1 of    8 is on borg01w044
> >>>>>> Process    2 of    8 is on borg01w044
> >>>>>> Process    4 of    8 is on borg01w044
> >>>>>> Process    6 of    8 is on borg01w044
> >>>>>>
> >>>>>> Now 1.8.2:
> >>>>>>
> >>>>>> (1144) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpifort -o
> helloWorld.182.x helloWorld.F90
> >>>>>> (1145) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun -np 8
> ./helloWorld.182.x
> >>>>>> (1146) $
> >>>>>>
> >>>>>> No output at all. But, if I take the helloWorld.x from 1.8.2 and
> run it with 1.8.2rc4's mpirun:
> >>>>>>
> >>>>>> (1146) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8
> ./helloWorld.182.x
> >>>>>> Process    5 of    8 is on borg01w044
> >>>>>> Process    7 of    8 is on borg01w044
> >>>>>> Process    2 of    8 is on borg01w044
> >>>>>> Process    4 of    8 is on borg01w044
> >>>>>> Process    1 of    8 is on borg01w044
> >>>>>> Process    3 of    8 is on borg01w044
> >>>>>> Process    6 of    8 is on borg01w044
> >>>>>> Process    0 of    8 is on borg01w044
> >>>>>>
> >>>>>> So...any idea what is happening here? There did seem to be a few
> SLURM related changes between the two tarballs involving /dev/null but it's
> a bit above me to decipher.
> >>>>>>
> >>>>>> You can find the ompi_info, build, make, config, etc logs at these
> links (they are ~300kB which is over the mailing list limit according to
> the Open MPI web page):
> >>>>>>
> >>>>>>
> https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2rc4-Output.tar.bz2
> >>>>>> https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2-Output.tar.bz2
> >>>>>>
> >>>>>> Thank you for any help and please let me know if you need more
> information,
> >>>>>> Matt
> >>>>>>
> >>>>>> --
> >>>>>> "And, isn't sanity really just a one-trick pony anyway? I mean all
> you
> >>>>>>  get is one trick: rational thinking. But when you're good and
> crazy,
> >>>>>>  oooh, oooh, oooh, the sky is the limit!" -- The Tick
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> us...@open-mpi.org
> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25182.php
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25184.php
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> "And, isn't sanity really just a one-trick pony anyway? I mean all
> you
> >>>>>  get is one trick: rational thinking. But when you're good and crazy,
> >>>>>  oooh, oooh, oooh, the sky is the limit!" -- The Tick
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25187.php
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25193.php
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> "And, isn't sanity really just a one-trick pony anyway? I mean all you
> >>>>  get is one trick: rational thinking. But when you're good and crazy,
> >>>>  oooh, oooh, oooh, the sky is the limit!" -- The Tick
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25196.php
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25197.php
> >>>
> >>>
> >>>
> >>> --
> >>> "And, isn't sanity really just a one-trick pony anyway? I mean all you
> >>>  get is one trick: rational thinking. But when you're good and crazy,
> >>>  oooh, oooh, oooh, the sky is the limit!" -- The Tick
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25204.php
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25205.php
> >>
> >>
> >>
> >> --
> >> "And, isn't sanity really just a one-trick pony anyway? I mean all you
> >>  get is one trick: rational thinking. But when you're good and crazy,
> >>  oooh, oooh, oooh, the sky is the limit!" -- The Tick
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/09/25210.php
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/09/25211.php
> >
> >
> >
> > --
> > "And, isn't sanity really just a one-trick pony anyway? I mean all you
> >  get is one trick: rational thinking. But when you're good and crazy,
> >  oooh, oooh, oooh, the sky is the limit!" -- The Tick
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/09/25219.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/09/25232.php
>



-- 
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy,
 oooh, oooh, oooh, the sky is the limit!" -- The Tick

Reply via email to