I would talk to the slurm folks about it - I don't know anything about the internals of HP-MPI, but I do know the relevant OMPI internals. OMPI doesn't do anything with respect to the envars. We just use "srun -hostlist <fff>" to launch the daemons. Each daemon subsequently gets a message telling it what local procs to run, and then fork/exec's those procs. The environment set for those procs is a copy of that given to the daemon, including any and all slurm values.
So whatever slurm sets, your procs get. My guess is that HP-MPI is doing something with the envars to create the difference. As for running OMPI procs directly from srun: the slurm folks put out a faq (or its equivalent) on it, I believe. I don't recall the details (even though I wrote the integration...). If you google our user and/or devel mailing lists, though, you'll see threads discussing it. Look for "slurmd" in the text - that's the ORTE integration module for that feature. On Thu, Feb 24, 2011 at 8:55 AM, Henderson, Brent <brent.hender...@hp.com>wrote: > I'm running OpenMPI v1.4.3 and slurm v2.2.1. I built both with the default > configuration except setting the prefix. The tests were run on the exact > same nodes (I only have two). > > When I run the test you outline below, I am still missing a bunch of env > variables with OpenMPI. I ran the extra test of using HP-MPI and they are > all present as with the srun invocation. I don't know if this is my slurm > setup or not, but I find this really weird. If anyone knows the magic to > make the fix that Ralph is referring to, I'd appreciate a pointer. > > My guess was that there is a subtle way that the launch differs between the > two products. But, since it works for Jeff, maybe there really is a slurm > option that I need to compile in or set to make this work the way I want. > It is not as simple as HP-MPI moving the environment variables itself as > some of the numbers will change per process created on the remote nodes. > > Thanks, > > Brent > > [brent@node2 mpi]$ salloc -N 2 > salloc: Granted job allocation 29 > [brent@node2 mpi]$ srun env | egrep ^SLURM_ | head > SLURM_NODELIST=node[1-2] > SLURM_NNODES=2 > SLURM_JOBID=29 > SLURM_TASKS_PER_NODE=1(x2) > SLURM_JOB_ID=29 > SLURM_NODELIST=node[1-2] > SLURM_NNODES=2 > SLURM_JOBID=29 > SLURM_TASKS_PER_NODE=1(x2) > SLURM_JOB_ID=29 > [brent@node2 mpi]$ srun env | egrep ^SLURM_ | wc -l > 66 > [brent@node2 mpi]$ srun env | egrep ^SLURM_ | sort > srun.out > [brent@node2 mpi]$ which mpirun > ~/bin/openmpi143/bin/mpirun > [brent@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | head > SLURM_NODELIST=node[1-2] > SLURM_NNODES=2 > SLURM_JOBID=29 > SLURM_TASKS_PER_NODE=8(x2) > SLURM_JOB_ID=29 > SLURM_SUBMIT_DIR=/mnt/node1/home/brent/src/mpi > SLURM_JOB_NODELIST=node[1-2] > SLURM_JOB_CPUS_PER_NODE=8(x2) > SLURM_JOB_NUM_NODES=2 > SLURM_NODELIST=node[1-2] > [brent@node2 mpi]$ which mpirun > ~/bin/openmpi143/bin/mpirun > [brent@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | wc -l > 42 <-- note, not 66 as above! > [brent@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | sort > > mpirun.out > [brent@node2 mpi]$ diff srun.out mpirun.out > 2d1 > < SLURM_CHECKPOINT_IMAGE_DIR=/mnt/node1/home/brent/src/mpi > 4,5d2 > < SLURM_CPUS_ON_NODE=8 > < SLURM_CPUS_PER_TASK=1 > 8d4 > < SLURM_DISTRIBUTION=cyclic > 10d5 > < SLURM_GTIDS=1 > 22,23d16 > < SLURM_LAUNCH_NODE_IPADDR=10.0.205.134 > < SLURM_LOCALID=0 > 25c18 > < SLURM_NNODES=2 > --- > > SLURM_NNODES=1 > 28d20 > < SLURM_NODEID=1 > 31,35c23,24 > < SLURM_NPROCS=2 > < SLURM_NPROCS=2 > < SLURM_NTASKS=2 > < SLURM_NTASKS=2 > < SLURM_PRIO_PROCESS=0 > --- > > SLURM_NPROCS=1 > > SLURM_NTASKS=1 > 38d26 > < SLURM_PROCID=1 > 40,56c28,35 > < SLURM_SRUN_COMM_HOST=10.0.205.134 > < SLURM_SRUN_COMM_PORT=43247 > < SLURM_SRUN_COMM_PORT=43247 > < SLURM_STEP_ID=2 > < SLURM_STEP_ID=2 > < SLURM_STEPID=2 > < SLURM_STEPID=2 > < SLURM_STEP_LAUNCHER_PORT=43247 > < SLURM_STEP_LAUNCHER_PORT=43247 > < SLURM_STEP_NODELIST=node[1-2] > < SLURM_STEP_NODELIST=node[1-2] > < SLURM_STEP_NUM_NODES=2 > < SLURM_STEP_NUM_NODES=2 > < SLURM_STEP_NUM_TASKS=2 > < SLURM_STEP_NUM_TASKS=2 > < SLURM_STEP_TASKS_PER_NODE=1(x2) > < SLURM_STEP_TASKS_PER_NODE=1(x2) > --- > > SLURM_SRUN_COMM_PORT=45154 > > SLURM_STEP_ID=5 > > SLURM_STEPID=5 > > SLURM_STEP_LAUNCHER_PORT=45154 > > SLURM_STEP_NODELIST=node1 > > SLURM_STEP_NUM_NODES=1 > > SLURM_STEP_NUM_TASKS=1 > > SLURM_STEP_TASKS_PER_NODE=1 > 59,62c38,40 > < SLURM_TASK_PID=1381 > < SLURM_TASK_PID=2288 > < SLURM_TASKS_PER_NODE=1(x2) > < SLURM_TASKS_PER_NODE=1(x2) > --- > > SLURM_TASK_PID=1429 > > SLURM_TASKS_PER_NODE=1 > > SLURM_TASKS_PER_NODE=8(x2) > 64,65d41 > < SLURM_TOPOLOGY_ADDR=node2 > < SLURM_TOPOLOGY_ADDR_PATTERN=node > [brent@node2 mpi]$ > [brent@node2 mpi]$ > [brent@node2 mpi]$ > [brent@node2 mpi]$ > [brent@node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -n 2 -N 2 env | egrep > ^SLURM_ | sort > hpmpi.out > [brent@node2 mpi]$ diff srun.out hpmpi.out > 20a21,22 > > SLURM_KILL_BAD_EXIT=1 > > SLURM_KILL_BAD_EXIT=1 > 41,48c43,50 > < SLURM_SRUN_COMM_PORT=43247 > < SLURM_SRUN_COMM_PORT=43247 > < SLURM_STEP_ID=2 > < SLURM_STEP_ID=2 > < SLURM_STEPID=2 > < SLURM_STEPID=2 > < SLURM_STEP_LAUNCHER_PORT=43247 > < SLURM_STEP_LAUNCHER_PORT=43247 > --- > > SLURM_SRUN_COMM_PORT=33347 > > SLURM_SRUN_COMM_PORT=33347 > > SLURM_STEP_ID=8 > > SLURM_STEP_ID=8 > > SLURM_STEPID=8 > > SLURM_STEPID=8 > > SLURM_STEP_LAUNCHER_PORT=33347 > > SLURM_STEP_LAUNCHER_PORT=33347 > 59,60c61,62 > < SLURM_TASK_PID=1381 > < SLURM_TASK_PID=2288 > --- > > SLURM_TASK_PID=1592 > > SLURM_TASK_PID=2590 > [brent@node2 mpi]$ > [brent@node2 mpi]$ > [brent@node2 mpi]$ grep SLURM_PROCID srun.out > SLURM_PROCID=0 > SLURM_PROCID=1 > [brent@node2 mpi]$ grep SLURM_PROCID mpirun.out > SLURM_PROCID=0 > [brent@node2 mpi]$ grep SLURM_PROCID hpmpi.out > SLURM_PROCID=0 > SLURM_PROCID=1 > [brent@node2 mpi]$ > > > > -----Original Message----- > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > > Behalf Of Jeff Squyres > > Sent: Thursday, February 24, 2011 9:31 AM > > To: Open MPI Users > > Subject: Re: [OMPI users] SLURM environment variables at runtime > > > > The weird thing is that when running his test, he saw different results > > with HP MPI vs. Open MPI. > > > > What his test didn't say was whether those were the same exact nodes or > > not. It would be good to repeat my experiment with the same exact > > nodes (e.g., inside one SLURM salloc job, or use the -w param to > > specify the same nodes for salloc for OMPI and srun for HP MPI). > > > > > > On Feb 24, 2011, at 10:02 AM, Ralph Castain wrote: > > > > > Like I said, this isn't an OMPI problem. You have your slurm > > configured to pass certain envars to the remote nodes, and Brent > > doesn't. It truly is just that simple. > > > > > > I've seen this before with other slurm installations. Which envars > > get set on the backend is configurable, that's all. > > > > > > Has nothing to do with OMPI. > > > > > > > > > On Thu, Feb 24, 2011 at 7:18 AM, Jeff Squyres <jsquy...@cisco.com> > > wrote: > > > I'm afraid I don't see the problem. Let's get 4 nodes from slurm: > > > > > > $ salloc -N 4 > > > > > > Now let's run env and see what SLURM_ env variables we see: > > > > > > $ srun env | egrep ^SLURM_ | head > > > SLURM_JOB_ID=95523 > > > SLURM_JOB_NUM_NODES=4 > > > SLURM_JOB_NODELIST=svbu-mpi[001-004] > > > SLURM_JOB_CPUS_PER_NODE=4(x4) > > > SLURM_JOBID=95523 > > > SLURM_NNODES=4 > > > SLURM_NODELIST=svbu-mpi[001-004] > > > SLURM_TASKS_PER_NODE=1(x4) > > > SLURM_PRIO_PROCESS=0 > > > SLURM_UMASK=0002 > > > $ srun env | egrep ^SLURM_ | wc -l > > > 144 > > > > > > Good -- there's 144 of them. Let's save them to a file for > > comparison, later. > > > > > > $ srun env | egrep ^SLURM_ | sort > srun.out > > > > > > Now let's repeat the process with mpirun. Note that mpirun defaults > > to running one process per core (vs. srun's default of running one per > > node). So let's tone mpirun down to use one process per node and look > > for the SLURM_ env variables. > > > > > > $ mpirun -np 4 --bynode env | egrep ^SLURM_ | head > > > SLURM_JOB_ID=95523 > > > SLURM_JOB_NUM_NODES=4 > > > SLURM_JOB_NODELIST=svbu-mpi[001-004] > > > SLURM_JOB_ID=95523 > > > SLURM_JOB_NUM_NODES=4 > > > SLURM_JOB_CPUS_PER_NODE=4(x4) > > > SLURM_JOBID=95523 > > > SLURM_NNODES=4 > > > SLURM_NODELIST=svbu-mpi[001-004] > > > SLURM_TASKS_PER_NODE=1(x4) > > > $ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l > > > 144 > > > > > > Good -- we also got 144. Save them to a file. > > > > > > $ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out > > > > > > Now let's compare what we got from srun and from mpirun: > > > > > > $ diff srun.out mpirun.out > > > 93,108c93,108 > > > < SLURM_SRUN_COMM_PORT=33571 > > > < SLURM_SRUN_COMM_PORT=33571 > > > < SLURM_SRUN_COMM_PORT=33571 > > > < SLURM_SRUN_COMM_PORT=33571 > > > < SLURM_STEP_ID=15 > > > < SLURM_STEP_ID=15 > > > < SLURM_STEP_ID=15 > > > < SLURM_STEP_ID=15 > > > < SLURM_STEPID=15 > > > < SLURM_STEPID=15 > > > < SLURM_STEPID=15 > > > < SLURM_STEPID=15 > > > < SLURM_STEP_LAUNCHER_PORT=33571 > > > < SLURM_STEP_LAUNCHER_PORT=33571 > > > < SLURM_STEP_LAUNCHER_PORT=33571 > > > < SLURM_STEP_LAUNCHER_PORT=33571 > > > --- > > > > SLURM_SRUN_COMM_PORT=54184 > > > > SLURM_SRUN_COMM_PORT=54184 > > > > SLURM_SRUN_COMM_PORT=54184 > > > > SLURM_SRUN_COMM_PORT=54184 > > > > SLURM_STEP_ID=18 > > > > SLURM_STEP_ID=18 > > > > SLURM_STEP_ID=18 > > > > SLURM_STEP_ID=18 > > > > SLURM_STEPID=18 > > > > SLURM_STEPID=18 > > > > SLURM_STEPID=18 > > > > SLURM_STEPID=18 > > > > SLURM_STEP_LAUNCHER_PORT=54184 > > > > SLURM_STEP_LAUNCHER_PORT=54184 > > > > SLURM_STEP_LAUNCHER_PORT=54184 > > > > SLURM_STEP_LAUNCHER_PORT=54184 > > > 125,128c125,128 > > > < SLURM_TASK_PID=3899 > > > < SLURM_TASK_PID=3907 > > > < SLURM_TASK_PID=3908 > > > < SLURM_TASK_PID=3997 > > > --- > > > > SLURM_TASK_PID=3924 > > > > SLURM_TASK_PID=3933 > > > > SLURM_TASK_PID=3934 > > > > SLURM_TASK_PID=4039 > > > $ > > > > > > They're identical except for per-step values (ports, PIDs, etc.) -- > > these differences are expected. > > > > > > What version of OMPI are you running? What happens if you repeat > > this experiment? > > > > > > I would find it very strange if Open MPI's mpirun is filtering some > > SLURM env variables to some processes and not to all -- your output > > shows disparate output between the different processes. That's just > > plain weird. > > > > > > > > > > > > On Feb 23, 2011, at 12:05 PM, Henderson, Brent wrote: > > > > > > > SLURM seems to be doing this in the case of a regular srun: > > > > > > > > [brent@node1 mpi]$ srun -N 2 -n 4 env | egrep > > SLURM_NODEID\|SLURM_PROCID\|SLURM_LOCALID | sort > > > > SLURM_LOCALID=0 > > > > SLURM_LOCALID=0 > > > > SLURM_LOCALID=1 > > > > SLURM_LOCALID=1 > > > > SLURM_NODEID=0 > > > > SLURM_NODEID=0 > > > > SLURM_NODEID=1 > > > > SLURM_NODEID=1 > > > > SLURM_PROCID=0 > > > > SLURM_PROCID=1 > > > > SLURM_PROCID=2 > > > > SLURM_PROCID=3 > > > > [brent@node1 mpi]$ > > > > > > > > Since srun is not supported currently by OpenMPI, I have to use > > salloc - right? In this case, it is up to OpenMPI to interpret the > > SLURM environment variables it sees in the one process that is launched > > and 'do the right thing' - whatever that means in this case. How does > > OpenMPI start the processes on the remote nodes under the covers? (use > > srun, generate a hostfile and launch as you would outside SLURM, ...) > > This may be the difference between HP-MPI and OpenMPI. > > > > > > > > Thanks, > > > > > > > > Brent > > > > > > > > > > > > From: users-boun...@open-mpi.org [mailto:users-bounces@open- > > mpi.org] On Behalf Of Ralph Castain > > > > Sent: Wednesday, February 23, 2011 10:07 AM > > > > To: Open MPI Users > > > > Subject: Re: [OMPI users] SLURM environment variables at runtime > > > > > > > > Resource managers generally frown on the idea of any program > > passing RM-managed envars from one node to another, and this is > > certainly true of slurm. The reason is that the RM reserves those > > values for its own use when managing remote nodes. For example, if you > > got an allocation and then used mpirun to launch a job across only a > > portion of that allocation, and then ran another mpirun instance in > > parallel on the remainder of the nodes, the slurm envars for those two > > mpirun instances -need- to be quite different. Having mpirun forward > > the values it sees would cause the system to become very confused. > > > > > > > > We learned the hard way never to cross that line :-( > > > > > > > > You have two options: > > > > > > > > (a) you could get your sys admin to configure slurm correctly to > > provide your desired envars on the remote nodes. This is the > > recommended (by slurm and other RMs) way of getting what you requested. > > It is a simple configuration option - if he needs help, he should > > contact the slurm mailing list > > > > > > > > (b) you can ask mpirun to do so, at your own risk. Specify each > > parameter with a "-x FOO" argument. See "man mpirun" for details. Keep > > an eye out for aberrant behavior. > > > > > > > > Ralph > > > > > > > > > > > > On Wed, Feb 23, 2011 at 8:38 AM, Henderson, Brent > > <brent.hender...@hp.com> wrote: > > > > Hi Everyone, I have an OpenMPI/SLURM specific question, > > > > > > > > I'm using MPI as a launcher for another application I'm working on > > and it is dependent on the SLURM environment variables making their way > > into the a.out's environment. This works as I need if I use HP- > > MPI/PMPI, but when I use OpenMPI, it appears that not all are set as I > > would like across all of the ranks. > > > > > > > > I have example output below from a simple a.out that just writes > > out the environment that it sees to a file whose name is based on the > > node name and rank number. Note that with OpenMPI, that things like > > SLURM_NNODES and SLURM_TASKS_PER_NODE are not set the same for ranks on > > the different nodes and things like SLURM_LOCALID are just missing > > entirely. > > > > > > > > So the question is, should the environment variables on the remote > > nodes (from the perspective of where the job is launched) have the full > > set of SLURM environment variables as seen on the launching node? > > > > > > > > Thanks, > > > > > > > > Brent Henderson > > > > > > > > [brent@node2 mpi]$ rm node* > > > > [brent@node2 mpi]$ mkdir openmpi hpmpi > > > > [brent@node2 mpi]$ salloc -N 2 -n 4 mpirun ./printenv.openmpi > > > > salloc: Granted job allocation 23 > > > > Hello world! I'm 3 of 4 on node1 > > > > Hello world! I'm 2 of 4 on node1 > > > > Hello world! I'm 1 of 4 on node2 > > > > Hello world! I'm 0 of 4 on node2 > > > > salloc: Relinquishing job allocation 23 > > > > [brent@node2 mpi]$ mv node* openmpi/ > > > > [brent@node2 mpi]$ egrep > > 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' > > openmpi/node1.3.of.4 > > > > SLURM_JOB_NODELIST=node[1-2] > > > > SLURM_NNODES=1 > > > > SLURM_NODELIST=node[1-2] > > > > SLURM_TASKS_PER_NODE=1 > > > > SLURM_NPROCS=1 > > > > SLURM_STEP_NODELIST=node1 > > > > SLURM_STEP_TASKS_PER_NODE=1 > > > > SLURM_NODEID=0 > > > > SLURM_PROCID=0 > > > > SLURM_LOCALID=0 > > > > [brent@node2 mpi]$ egrep > > 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' > > openmpi/node2.1.of.4 > > > > SLURM_JOB_NODELIST=node[1-2] > > > > SLURM_NNODES=2 > > > > SLURM_NODELIST=node[1-2] > > > > SLURM_TASKS_PER_NODE=2(x2) > > > > SLURM_NPROCS=4 > > > > [brent@node2 mpi]$ > > > > > > > > > > > > [brent@node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -N 2 -n 4 > > ./printenv.hpmpi > > > > Hello world! I'm 2 of 4 on node2 > > > > Hello world! I'm 3 of 4 on node2 > > > > Hello world! I'm 0 of 4 on node1 > > > > Hello world! I'm 1 of 4 on node1 > > > > [brent@node2 mpi]$ mv node* hpmpi/ > > > > [brent@node2 mpi]$ egrep > > 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' > > hpmpi/node1.1.of.4 > > > > SLURM_NODELIST=node[1-2] > > > > SLURM_TASKS_PER_NODE=2(x2) > > > > SLURM_STEP_NODELIST=node[1-2] > > > > SLURM_STEP_TASKS_PER_NODE=2(x2) > > > > SLURM_NNODES=2 > > > > SLURM_NPROCS=4 > > > > SLURM_NODEID=0 > > > > SLURM_PROCID=1 > > > > SLURM_LOCALID=1 > > > > [brent@node2 mpi]$ egrep > > 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' > > hpmpi/node2.3.of.4 > > > > SLURM_NODELIST=node[1-2] > > > > SLURM_TASKS_PER_NODE=2(x2) > > > > SLURM_STEP_NODELIST=node[1-2] > > > > SLURM_STEP_TASKS_PER_NODE=2(x2) > > > > SLURM_NNODES=2 > > > > SLURM_NPROCS=4 > > > > SLURM_NODEID=1 > > > > SLURM_PROCID=3 > > > > SLURM_LOCALID=1 > > > > [brent@node2 mpi]$ > > > > > > > > _______________________________________________ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > _______________________________________________ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > -- > > > Jeff Squyres > > > jsquy...@cisco.com > > > For corporate legal information go to: > > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >