Hi All,

I’m glad to see this come up.  We’ve used OpenMPI for a long time and switched 
to SLURM (from torque+moab) about 2.5 years ago.  At the time, I had a lot of 
questions about running MPI jobs under SLURM and good information seemed to be 
scarce - especially regarding “srun”.   I’ll just briefly share my/our 
observations.  For those who are interested, there are examples of our 
suggested submission scripts at 
https://help.rc.ufl.edu/doc/Sample_SLURM_Scripts#MPI_job 
<https://help.rc.ufl.edu/doc/Sample_SLURM_Scripts#MPI_job> (as I type this I’m 
hoping that page is up-to-date).  Feel free to comment or make suggestions if 
you have had different experiences or know better (very possible).

1. We initially ignored srun since mpiexec _seemed_ to work fine (more below).

2. We soon started to get user complaints of MPI apps running at 1/2 to 1/3 of 
their expected or previously observed speeds - but only sporadically - meaning 
that sometimes the same job, submitted the same way would run at full speed and 
sometimes at 1/2 or 1/3 (almost exactly) speed.

Investigation showed that some MPI ranks in the job were time-slicing across 
one or more of the cores allocated by SLURM.  It turns out that if the slurm 
allocation is not consistent with the default OMPI core/socket mapping, this 
can easily happen.  It can be avoided by a) using “srun —mpi=pmi2” or as of 
2.x, “srun —mpi=pmix” or b) more carefully crafting your slurm resource request 
to be consistent with the OMPI default core/socket mapping.

So beware of resource requests that specify only the number of tasks 
(—ntasks=64) and then launch with “mpiexec”.  Slurm will happily allocate those 
tasks anywhere it can (on a busy cluster) and you will get some very 
non-optimal core mappings/bindings and, possibly, core sharing.

3. While doing some spank development for a local, per-job (not per step) 
temporary directory, I noticed that when launching multi-host MPI jobs with 
mpiexec vs srun, you end up with more than one host with “slurm_nodeid=1”.  I’m 
not sure if this is a bug (it was 15.08.x) or not and it didn’t seem to cause 
issues but I also don’t think that it is ideal for two nodes in the same job to 
have the some numeric nodeid.   When launching with “srun”, that didn’t happen.

Anyway, that is what we have observed.  Generally speaking, I try to get users 
to use “srun” but many of them still use “mpiexec” out of habit.  You know what 
they say about old habits.  

Comments, suggestions, or just other experiences are welcome.  Also, if anyone 
is interested in the tmpdir spank plugin, you can contact me.  We are happy to 
share.

Best and Merry Christmas to all,

Charlie Taylor
UF Research Computing



> On Dec 18, 2017, at 8:12 PM, r...@open-mpi.org wrote:
> 
> We have had reports of applications running faster when executing under 
> OMPI’s mpiexec versus when started by srun. Reasons aren’t entirely clear, 
> but are likely related to differences in mapping/binding options (OMPI 
> provides a very large range compared to srun) and optimization flags provided 
> by mpiexec that are specific to OMPI.
> 
> OMPI uses PMIx for wireup support (starting with the v2.x series), which 
> provides a faster startup than other PMI implementations. However, that is 
> also available with Slurm starting with the 16.05 release, and some further 
> PMIx-based launch optimizations were recently added to the Slurm 17.11 
> release. So I would expect that launch via srun with the latest Slurm release 
> and PMIx would be faster than mpiexec - though that still leaves the faster 
> execution reports to consider.
> 
> HTH
> Ralph
> 
> 
>> On Dec 18, 2017, at 2:18 PM, Prentice Bisbal <pbis...@pppl.gov> wrote:
>> 
>> Greeting OpenMPI users and devs!
>> 
>> We use OpenMPI with Slurm as our scheduler, and a user has asked me this: 
>> should they use mpiexec/mpirun or srun to start their MPI jobs through Slurm?
>> 
>> My inclination is to use mpiexec, since that is the only method that's 
>> (somewhat) defined in the MPI standard and therefore the most portable, and 
>> the examples in the OpenMPI FAQ use mpirun. However, the Slurm documentation 
>> on the schedmd website say to use srun with the --mpi=pmi option. (See links 
>> below)
>> 
>> What are the pros/cons of using these two methods, other than the 
>> portability issue I already mentioned? Does srun+pmi use a different method 
>> to wire up the connections? Some things I read online seem to indicate that. 
>> If slurm was built with PMI support, and OpenMPI was built with Slurm 
>> support, does it really make any difference?
>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Dslurm&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=S8O6oozkRUTwijpQQDmGrJZb8Bmnsro9a88Z8CMu6jY&e=
>>  
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_mpi-5Fguide.html-23open-5Fmpi&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=yqxzEPgafSoGS_SpzI5MPObbJIcemIX7Z4AHgk4SseA&e=
>>  
>> 
>> 
>> -- 
>> Prentice
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=AKWlbF5DdrTaJapOsTSYDiSa3bJTFnjlG6Whfi2_MA4&e=
>>  
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=AKWlbF5DdrTaJapOsTSYDiSa3bJTFnjlG6Whfi2_MA4&e=

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to