from:"Ralph Castain"

[slurm-dev] Re: MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1

2016-07-20 Thread Ralph Castain

It looks like the job timed out - I’m guessing that there is some kind of 
timeout spec being applied to the batch script that is not being applied to the 
interactive execution

> On Jul 20, 2016, at 12:19 PM, Kwok, Patrick  
> wrote:
> 
> Hi SLURM gurus,
>  
> I’m new to using slurm, so please excuse my lack of knowledge.
>  
> We’re trying to schedule an mpi program called Pegasos, according to Elekta, 
> who installed/configured Pegasos, it works with mpirun.  I created a shell 
> script using mpirun, and I am trying to run it on 2 node, using 20CPUs each.
>  
> 1)  more test.mpirun.sh
> #!/bin/bash
> #SBATCH --ntasks-per-node=20
> #SBATCH -N 2
> mpirun -np 40 PegasosMPI test.sim
>  
> Running the bash script directly will finish normally.  Next, I try to submit 
> the job/script with sbatch
>  
> 2)  sbatch test.mpirun.sh 
>  
> It uses resources on the 2 nodes as expected, but Pegasos did not seem to run 
> as no output files were generated.  Here is the output for slurm-.out:
>  
> “srun: cluster configuration lacks support for cpu binding
> pegasos$Receive timeout, aborting simulation [PegasosMPI.cc 
> ,274]
>  
> --
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode -1.
>  
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --“
>  
> Can anyone help?
>  
> Thanks!
> Patrick
> This e-mail is intended only for the named recipient(s) and may contain 
> confidential, personal and/or health information (information which may be 
> subject to legal restrictions on use, retention and/or disclosure).  No 
> waiver of confidence is intended by virtue of communication via the internet. 
>  Any review or distribution by anyone other than the person(s) for whom it 
> was originally intended is strictly prohibited.  If you have received this 
> e-mail in error, please contact the sender and destroy all copies.

[slurm-dev] Re: Performance Degradation

2016-06-21 Thread Ralph Castain


I wouldn’t say mpirun is “screwing up” the placement of the ranks, but it will 
default to filling a node before starting to place procs on the next node. This 
is done to optimize performance as shared memory is faster than inter-node 
fabric. If you want the procs to instead “balance” across all available nodes, 
then you need to tell mpirun that’s what you want.

Check “mpirun -h” to find the right option to get the desired behavior.

> On Jun 21, 2016, at 6:49 AM, Peter Kjellström  wrote:
> 
> 
> On Sun, 19 Jun 2016 09:15:34 -0700
> Achi Hamza  wrote:
> 
>> Hi everyone,
>> 
>> I set a lab consists of 5 simple nodes (1 head and 4 compute nodes),
>> i used SLURM 15.08.11, OpenMpi 1.10.2, MPICH 3.2, FFTW 3.3.4 and
>> LAMMPS 16May2016.
> 
> Did you use OpenMPI _or_ MPICH?
> 
> Two observations on the below behavior:
> 
> * a 17 second runtime could indicate a very small problem that doesn't
>  scale past a few ranks.
> 
> * Your mpi launch could screw up the placement/pinning of ranks
> 
> /Peter K
> 
>> After successful installation of the above i conducted some tests
>> using the existing examples of LAMMPS. I got unrealistic results, the
>> time execution goes up as i increase the number of nodes !
>> 
>> mpirun -np 4 lmp_openmpi < in.vacf.2d
>> *Total wall time: 0:00:17*
>> 
>> mpirun -np 8 lmp_openmpi < in.vacf.2d
>> *Total wall time: 0:00:23*
>> 
>> mpirun -np 12 lmp_openmpi < in.vacf.2d
>> *Total wall time: 0:00:28*
>> 
>> mpirun -np 16 lmp_openmpi < in.vacf.2d
>> *Total wall time: 0:00:33*
>> 
>> 
>> interestingly, *srun* results are worse than mpirun:
>> 
>> srun --mpi=pmi2 -n 16 lmp_openmpi < in.vacf.2d
>> 
>> *Total wall time: 0:05:54*

[slurm-dev] Re: Processes sharing cores

2016-06-09 Thread Ralph Castain

Hi Jason

It sounds like the srun executed inside each mpirun is not getting bound to a 
specific set of cores, or else we are not correctly picking that up and staying 
within it. So let me see if I fully understand the scenario, and please forgive 
this old fossil brain if you’ve explained all this before:

You are executing multiple parallel sbatch commands on the same nodes, with 
each sbatch requesting and being allocated only a subset of cores on those 
nodes. Within each sbatch, you are executing a single mpirun that launches an 
application.

Is that accurate? If so, I can try to replicate and test this here if you tell 
me how you built and configured SLURM (as I haven’t used their task/affinity 
plugin before)

Ralph

> On Jun 9, 2016, at 7:35 AM, Jason Bacon <bacon4...@gmail.com> wrote:
> 
> 
> 
> Thanks for all the suggestions, everyone.
> 
> A little more info:
> 
> I had to do a new OMPI build using --with-pmi.  Binding works correctly using 
> srun with this build, but mpirun still ignores the SLURM core assignments.
> 
> I also patched the task/affinity plugin for FreeBSD for the sake of 
> comparison (minor differences in the cpuset API).  It's not 100% yet, but it 
> appears that mpirun is ignoring the SLURM core assignments there as well.
> 
> Next question:
> 
> Is anyone out there seeing mpirun obey the core assignments from SLURM's 
> task/affinity plugin?  If so, I'd love to see your configure arguments for 
> both SLURM and OMPI.
> 
> I have growing doubts that this interface is working, though.  I can imagine 
> this issue going unnoticed most of the time, because it will only cause a 
> problem when an OMPI job shares a node with another job using core binding, 
> which is infrequent on our clusters.  Even when that happens, it may still go 
> unnoticed unless someone is monitoring performance carefully, because the 
> only likely impact is a few processes running at 50% their normal speed 
> because they're sharing a core.
> 
> I think this is worth fixing and I'd be happy to help with the coding and 
> testing.  We can't police how every user starts their MPI jobs, so it would 
> be good if it works properly no matter what they use.
> 
> Thanks again,
> 
>Jason
> 
> On 06/07/16 20:17, Ralph Castain wrote:
>> Yes, it should - provided the job step executing each mpirun has been given 
>> a unique binding. I suspect this is the problem you are encountering, but 
>> can’t know for certain. You could run an app that prints out its binding and 
>> then see if two parallel executions of srun yield different values.
>> 
>> 
>>> On Jun 7, 2016, at 5:26 PM, Jason Bacon <bacon4...@gmail.com 
>>> <mailto:bacon4...@gmail.com> <mailto:bacon4...@gmail.com 
>>> <mailto:bacon4...@gmail.com>>> wrote:
>>> 
>>> 
>>> So this *should* work even for two separate MPI jobs sharing a node?
>>> 
>>> Thanks much,
>>> 
>>>Jason
>>> 
>>> On 06/07/2016 09:09, Ralph Castain wrote:
>>>> Yes, it should. What’s odd is that mpirun launches its daemons using srun 
>>>> under the covers, and the daemon should therefore be bound. We detect that 
>>>> and use it, but I’m not sure why this isn’t working here.
>>>> 
>>>> 
>>>>> On Jun 7, 2016, at 6:52 AM, Bruce Roberts <schedulerk...@gmail.com 
>>>>> <mailto:schedulerk...@gmail.com> <mailto:schedulerk...@gmail.com 
>>>>> <mailto:schedulerk...@gmail.com>>> wrote:
>>>>> 
>>>>> What happens if you use srun instead of mpirun? I would expect that to 
>>>>> work correctly.
>>>>> 
>>>>> On June 7, 2016 6:31:27 AM MST, Ralph Castain <r...@open-mpi.org 
>>>>> <mailto:r...@open-mpi.org> <mailto:r...@open-mpi.org 
>>>>> <mailto:r...@open-mpi.org>>> wrote:
>>>>> 
>>>>>No, we don’t pick that up - suppose we could try. Those envars
>>>>>have a history of changing, though, and it gets difficult to
>>>>>match the version with the var.
>>>>> 
>>>>>I can put this on my “nice to do someday” list and see if/when
>>>>>we can get to it. Just so I don’t have to parse around more -
>>>>>what version of slurm are you using?
>>>>> 
>>>>> 
>>>>>>On Jun 7, 2016, at 6:15 AM, Jason Bacon <bacon4...@gmail.com 
>>>>>> <mailto:bacon4...@gmail.com>>
>>>>>>wrote:
>>>>>> 
>>>>>>

[slurm-dev] Re: Processes sharing cores

2016-06-07 Thread Ralph Castain

OMPI doesn’t use cgroups because we run at the user level, so we can’t set them 
on our child processes


> On Jun 7, 2016, at 7:16 AM, Bruce Roberts <schedulerk...@gmail.com> wrote:
> 
> Not using cgroups? 
> 
> On June 7, 2016 7:10:19 AM PDT, Ralph Castain <r...@open-mpi.org> wrote:
> Yes, it should. What’s odd is that mpirun launches its daemons using srun 
> under the covers, and the daemon should therefore be bound. We detect that 
> and use it, but I’m not sure why this isn’t working here.
> 
> 
>> On Jun 7, 2016, at 6:52 AM, Bruce Roberts <schedulerk...@gmail.com 
>> <mailto:schedulerk...@gmail.com>> wrote:
>> 
>> What happens if you use srun instead of mpirun? I would expect that to work 
>> correctly. 
>> 
>> On June 7, 2016 6:31:27 AM MST, Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>> wrote:
>> No, we don’t pick that up - suppose we could try. Those envars have a 
>> history of changing, though, and it gets difficult to match the version with 
>> the var.
>> 
>> I can put this on my “nice to do someday” list and see if/when we can get to 
>> it. Just so I don’t have to parse around more - what version of slurm are 
>> you using?
>> 
>> 
>>> On Jun 7, 2016, at 6:15 AM, Jason Bacon <bacon4...@gmail.com 
>>> <mailto:bacon4...@gmail.com>> wrote:
>>> 
>>> 
>>> 
>>> Thanks for the tip, but does OpenMPI not use SBATCH_CPU_BIND_* when SLURM 
>>> integration is compiled in?
>>> 
>>> printenv in the sbatch script produces the following:
>>> 
>>> Linux login.finch bacon ~/Data/Testing/Facil/Software/Src/Bench/MPI 379: 
>>> grep SBATCH slurm-5*
>>> slurm-579.out:SBATCH_CPU_BIND_LIST=0x3
>>> slurm-579.out:SBATCH_CPU_BIND_VERBOSE=verbose
>>> slurm-579.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
>>> slurm-579.out:SBATCH_CPU_BIND=verbose,mask_cpu:0x3
>>> slurm-580.out:SBATCH_CPU_BIND_LIST=0xC
>>> slurm-580.out:SBATCH_CPU_BIND_VERBOSE=verbose
>>> slurm-580.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
>>> slurm-580.out:SBATCH_CPU_BIND=verbose,mask_cpu:0xC
>>> 
>>> All OpenMPI jobs are using cores 0 and 2, although SLURM has assigned 0 and 
>>> 1 to job 579 and 2 and 3 to 580.
>>> 
>>> Regards,
>>> 
>>>Jason
>>> 
>>> On 06/06/16 21:11, Ralph Castain wrote:
>>>> Running two jobs across the same nodes is indeed an issue. Regardless of 
>>>> which MPI you use, the second mpiexec has no idea that the first one 
>>>> exists. Thus, the bindings applied to the second job will be computed as 
>>>> if the first job doesn’t exist - and thus, the procs will overload on top 
>>>> of each other.
>>>> 
>>>> The way you solve this with OpenMPI is by using the -slot-list  
>>>> option. This tells each mpiexec which cores are allocated to it, and it 
>>>> will constrain its binding calculation within that envelope. Thus, if you 
>>>> start the first job with -slot-list 0-2, and the second with -slot-list 
>>>> 3-5, the two jobs will be isolated from each other.
>>>> 
>>>> You can use any specification for the slot-list - it takes a 
>>>> comma-separated list of cores.
>>>> 
>>>> HTH
>>>> Ralph
>>>> 
>>>>> On Jun 6, 2016, at 6:08 PM, Jason Bacon <bacon4...@gmail.com 
>>>>> <mailto:bacon4...@gmail.com> <mailto:bacon4...@gmail.com 
>>>>> <mailto:bacon4...@gmail.com>>> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> Actually, --bind-to core is the default for most OpenMPI jobs now, so 
>>>>> adding this flag has no effect.  It refers to the processes within the 
>>>>> job.
>>>>> 
>>>>> I'm thinking this is an MPI-SLURM integration issue. Embarrassingly 
>>>>> parallel SLURM jobs are binding properly, but MPI jobs are ignoring the 
>>>>> SLURM environment and choosing their own cores.
>>>>> 
>>>>> OpenMPI was built with --with-slurm and it appears from config.log that 
>>>>> it located everything it needed.
>>>>> 
>>>>> I can work around the problem with "mpirun --bind-to none", which I'm 
>>>>> guessing will impact performance slightly for memory-intensive apps.
>>>>> 
>>>>> We're still digging on this one and may be for a while...
>>>>

[slurm-dev] Re: Processes sharing cores

2016-06-07 Thread Ralph Castain

Yes, it should. What’s odd is that mpirun launches its daemons using srun under 
the covers, and the daemon should therefore be bound. We detect that and use 
it, but I’m not sure why this isn’t working here.


> On Jun 7, 2016, at 6:52 AM, Bruce Roberts <schedulerk...@gmail.com> wrote:
> 
> What happens if you use srun instead of mpirun? I would expect that to work 
> correctly. 
> 
> On June 7, 2016 6:31:27 AM MST, Ralph Castain <r...@open-mpi.org> wrote:
> No, we don’t pick that up - suppose we could try. Those envars have a history 
> of changing, though, and it gets difficult to match the version with the var.
> 
> I can put this on my “nice to do someday” list and see if/when we can get to 
> it. Just so I don’t have to parse around more - what version of slurm are you 
> using?
> 
> 
>> On Jun 7, 2016, at 6:15 AM, Jason Bacon <bacon4...@gmail.com 
>> <mailto:bacon4...@gmail.com>> wrote:
>> 
>> 
>> 
>> Thanks for the tip, but does OpenMPI not use SBATCH_CPU_BIND_* when SLURM 
>> integration is compiled in?
>> 
>> printenv in the sbatch script produces the following:
>> 
>> Linux login.finch bacon ~/Data/Testing/Facil/Software/Src/Bench/MPI 379: 
>> grep SBATCH slurm-5*
>> slurm-579.out:SBATCH_CPU_BIND_LIST=0x3
>> slurm-579.out:SBATCH_CPU_BIND_VERBOSE=verbose
>> slurm-579.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
>> slurm-579.out:SBATCH_CPU_BIND=verbose,mask_cpu:0x3
>> slurm-580.out:SBATCH_CPU_BIND_LIST=0xC
>> slurm-580.out:SBATCH_CPU_BIND_VERBOSE=verbose
>> slurm-580.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
>> slurm-580.out:SBATCH_CPU_BIND=verbose,mask_cpu:0xC
>> 
>> All OpenMPI jobs are using cores 0 and 2, although SLURM has assigned 0 and 
>> 1 to job 579 and 2 and 3 to 580.
>> 
>> Regards,
>> 
>>Jason
>> 
>> On 06/06/16 21:11, Ralph Castain wrote:
>>> Running two jobs across the same nodes is indeed an issue. Regardless of 
>>> which MPI you use, the second mpiexec has no idea that the first one 
>>> exists. Thus, the bindings applied to the second job will be computed as if 
>>> the first job doesn’t exist - and thus, the procs will overload on top of 
>>> each other.
>>> 
>>> The way you solve this with OpenMPI is by using the -slot-list  
>>> option. This tells each mpiexec which cores are allocated to it, and it 
>>> will constrain its binding calculation within that envelope. Thus, if you 
>>> start the first job with -slot-list 0-2, and the second with -slot-list 
>>> 3-5, the two jobs will be isolated from each other.
>>> 
>>> You can use any specification for the slot-list - it takes a 
>>> comma-separated list of cores.
>>> 
>>> HTH
>>> Ralph
>>> 
>>>> On Jun 6, 2016, at 6:08 PM, Jason Bacon <bacon4...@gmail.com 
>>>> <mailto:bacon4...@gmail.com> <mailto:bacon4...@gmail.com 
>>>> <mailto:bacon4...@gmail.com>>> wrote:
>>>> 
>>>> 
>>>> 
>>>> Actually, --bind-to core is the default for most OpenMPI jobs now, so 
>>>> adding this flag has no effect.  It refers to the processes within the job.
>>>> 
>>>> I'm thinking this is an MPI-SLURM integration issue. Embarrassingly 
>>>> parallel SLURM jobs are binding properly, but MPI jobs are ignoring the 
>>>> SLURM environment and choosing their own cores.
>>>> 
>>>> OpenMPI was built with --with-slurm and it appears from config.log that it 
>>>> located everything it needed.
>>>> 
>>>> I can work around the problem with "mpirun --bind-to none", which I'm 
>>>> guessing will impact performance slightly for memory-intensive apps.
>>>> 
>>>> We're still digging on this one and may be for a while...
>>>> 
>>>>   Jason
>>>> 
>>>> On 06/03/16 15:48, Benjamin Redling wrote:
>>>>> On 2016-06-03 21:25, Jason Bacon wrote:
>>>>>> It might be worth mentioning that the calcpi-parallel jobs are run with
>>>>>> --array (no srun).
>>>>>> 
>>>>>> Disabling the task/affinity plugin and using "mpirun --bind-to core"
>>>>>> works around the issue.  The MPI processes bind to specific cores and
>>>>>> the embarrassingly parallel jobs kindly move over and stay out of the 
>>>>>> way.
>>>>> Are the mpirun --bind-to core child processes the same as a slurm task?
>>>>> I have no exper

[slurm-dev] Re: Processes sharing cores

2016-06-07 Thread Ralph Castain

No, we don’t pick that up - suppose we could try. Those envars have a history 
of changing, though, and it gets difficult to match the version with the var.

I can put this on my “nice to do someday” list and see if/when we can get to 
it. Just so I don’t have to parse around more - what version of slurm are you 
using?


> On Jun 7, 2016, at 6:15 AM, Jason Bacon <bacon4...@gmail.com> wrote:
> 
> 
> 
> Thanks for the tip, but does OpenMPI not use SBATCH_CPU_BIND_* when SLURM 
> integration is compiled in?
> 
> printenv in the sbatch script produces the following:
> 
> Linux login.finch bacon ~/Data/Testing/Facil/Software/Src/Bench/MPI 379: grep 
> SBATCH slurm-5*
> slurm-579.out:SBATCH_CPU_BIND_LIST=0x3
> slurm-579.out:SBATCH_CPU_BIND_VERBOSE=verbose
> slurm-579.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
> slurm-579.out:SBATCH_CPU_BIND=verbose,mask_cpu:0x3
> slurm-580.out:SBATCH_CPU_BIND_LIST=0xC
> slurm-580.out:SBATCH_CPU_BIND_VERBOSE=verbose
> slurm-580.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
> slurm-580.out:SBATCH_CPU_BIND=verbose,mask_cpu:0xC
> 
> All OpenMPI jobs are using cores 0 and 2, although SLURM has assigned 0 and 1 
> to job 579 and 2 and 3 to 580.
> 
> Regards,
> 
>Jason
> 
> On 06/06/16 21:11, Ralph Castain wrote:
>> Running two jobs across the same nodes is indeed an issue. Regardless of 
>> which MPI you use, the second mpiexec has no idea that the first one exists. 
>> Thus, the bindings applied to the second job will be computed as if the 
>> first job doesn’t exist - and thus, the procs will overload on top of each 
>> other.
>> 
>> The way you solve this with OpenMPI is by using the -slot-list  option. 
>> This tells each mpiexec which cores are allocated to it, and it will 
>> constrain its binding calculation within that envelope. Thus, if you start 
>> the first job with -slot-list 0-2, and the second with -slot-list 3-5, the 
>> two jobs will be isolated from each other.
>> 
>> You can use any specification for the slot-list - it takes a comma-separated 
>> list of cores.
>> 
>> HTH
>> Ralph
>> 
>>> On Jun 6, 2016, at 6:08 PM, Jason Bacon <bacon4...@gmail.com 
>>> <mailto:bacon4...@gmail.com> <mailto:bacon4...@gmail.com 
>>> <mailto:bacon4...@gmail.com>>> wrote:
>>> 
>>> 
>>> 
>>> Actually, --bind-to core is the default for most OpenMPI jobs now, so 
>>> adding this flag has no effect.  It refers to the processes within the job.
>>> 
>>> I'm thinking this is an MPI-SLURM integration issue. Embarrassingly 
>>> parallel SLURM jobs are binding properly, but MPI jobs are ignoring the 
>>> SLURM environment and choosing their own cores.
>>> 
>>> OpenMPI was built with --with-slurm and it appears from config.log that it 
>>> located everything it needed.
>>> 
>>> I can work around the problem with "mpirun --bind-to none", which I'm 
>>> guessing will impact performance slightly for memory-intensive apps.
>>> 
>>> We're still digging on this one and may be for a while...
>>> 
>>>   Jason
>>> 
>>> On 06/03/16 15:48, Benjamin Redling wrote:
>>>> On 2016-06-03 21:25, Jason Bacon wrote:
>>>>> It might be worth mentioning that the calcpi-parallel jobs are run with
>>>>> --array (no srun).
>>>>> 
>>>>> Disabling the task/affinity plugin and using "mpirun --bind-to core"
>>>>> works around the issue.  The MPI processes bind to specific cores and
>>>>> the embarrassingly parallel jobs kindly move over and stay out of the way.
>>>> Are the mpirun --bind-to core child processes the same as a slurm task?
>>>> I have no experience at all with MPI jobs -- just trying to understand
>>>> task/affinity and params.
>>>> 
>>>> As far as I understand when you let mpirun do the binding it handles the
>>>> binding different https://www.open-mpi.org/doc/v1.8/man1/mpirun.1.php
>>>> 
>>>> If I grok the
>>>> % mpirun ... --map-by core --bind-to core
>>>> example in the "Mapping, Ranking, and Binding: Oh My!" section right.
>>>> 
>>>> 
>>>>> On 06/03/16 10:18, Jason Bacon wrote:
>>>>>> We're having an issue with CPU binding when two jobs land on the same
>>>>>> node.
>>>>>> 
>>>>>> Some cores are shared by the 2 jobs while others are left idle. Below
>>>> [...]
>>>>>> TaskPluginParam=cores,verbose
>>>> don't you bind each _job_ to a single core because you override
>>>> automatic binding and thous prevent binding each child process to
>>>> different core?
>>>> 
>>>> 
>>>> Regards,
>>>> Benjamin
>>> 
>>> 
>>> --
>>> All wars are civil wars, because all men are brothers ... Each one owes
>>> infinitely more to the human race than to the particular country in
>>> which he was born.
>>>   -- Francois Fenelon
>> 
> 
> 
> -- 
> All wars are civil wars, because all men are brothers ... Each one owes
> infinitely more to the human race than to the particular country in
> which he was born.
>-- Francois Fenelon

[slurm-dev] Re: MPI/OpenMPI send receive not working

2016-04-30 Thread Ralph Castain

If you can identify the name of the adaptor (e.g., “eth0”), then you can either:

* include the one you want to use: -mca oob_tcp_if_include  -mca 
btl_tcp_if_include 

* exclude the Internet adaptor: -mca oob_tcp_if_exclude  -mca 
btl_tcp_if_exclude 

You cannot do both at the same time.

FWIW: it would help us to help you if you tell us up front that you are working 
with virtual machines as there are special issues when doing so :-/


> On Apr 30, 2016, at 12:51 PM, Mehdi Acheli <cm_ach...@esi.dz> wrote:
> 
> No, the original program didn't include a bug. It's failing due to the same 
> reason as the second. Since there is only one process in the world, when the 
> original program tries to mention another process with rank 1, it throws an 
> error. On the other hand, yes. It seems I have a problem on my SLURM/OMPI 
> integration. For the moment, I guess I'll just have to work with "salloc -> 
> mpirun" 
> Thankfully, I was able to locate the problem through "--mca plm_base_verbose 
> 10" option. I am running my cluster on virtual machines, each one having two 
> network adapters. One for the local access and the other connected to 
> Internet. I don't know why but OMPI tries to use the Internet network adapter 
> thus failing to establish communication. I had to remove the said adapter. Is 
> there a way to configure OMPI to avoid the problem ?
> 
> Thank you again for your interventions.
> 
> 
> 
> 2016-04-30 20:34 GMT+01:00 Ralph Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>>:
> As I said, your original program has a bug in it - you are using “rank” 
> values that are invalid. This is why it is failing when run under mpirun.
> 
> This second problem is caused by your SLURM integration to OMPI being broken, 
> probably due to not correctly linking the PMI support
> 
> 
>> On Apr 30, 2016, at 11:56 AM, Mehdi Acheli <cm_ach...@esi.dz 
>> <mailto:cm_ach...@esi.dz>> wrote:
>> 
>> Yes, if I use "salloc -N2 sh" and then launch the job via mpirun, the hello 
>> world program is doing well. However my original program is still blocking 
>> on the send and receive lines.
>> 
>> 2016-04-30 19:47 GMT+01:00 Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>>:
>> Your slurm-OMPI integration is clearly broken - the processes do not realize 
>> they are operating in a common world. Does it work if you use mpirun instead 
>> of srun?
>> 
>> 
>>> On Apr 30, 2016, at 11:28 AM, Mehdi Acheli <cm_ach...@esi.dz 
>>> <mailto:cm_ach...@esi.dz>> wrote:
>>> 
>>> No, I just tested another program and it seems that the world_size is 
>>> reduced to one even though i launch the job on two nodes. The hello program 
>>> is doing the same. Well, I am completely lost now.
>>> 
>>> 
>>> 
>>> 
>>> 2016-04-30 19:09 GMT+01:00 Ralph Castain <r...@open-mpi.org 
>>> <mailto:r...@open-mpi.org>>:
>>> This looks like a bug in your program - you specified an invalid rank when 
>>> attempting to send.
>>> 
>>>> On Apr 30, 2016, at 10:59 AM, Mehdi Acheli <cm_ach...@esi.dz 
>>>> <mailto:cm_ach...@esi.dz>> wrote:
>>>> 
>>>> I just did. Permit me to include a capture of the script output file: 
>>>> 
>>>> 
>>>> 
>>>> I specify in my script the option "-N 2", but it looks like the world_size 
>>>> is composed of only one process and both nodes are trying to execute an 
>>>> MPI_Send !
>>>> 
>>>> 2016-04-30 18:41 GMT+01:00 Andy Riebs <andy.ri...@hpe.com 
>>>> <mailto:andy.ri...@hpe.com>>:
>>>> Aha! I missed it the first time... In your script, replace "mpirun" with 
>>>> "srun" and the world should be better.
>>>> 
>>>> On 04/30/2016 01:35 PM, Mehdi Acheli wrote:
>>>>> Euh, I did a "make all install" so I think pmi support is installed. And 
>>>>> the hello world program is working, would it if it wasn't installed ?
>>>>> 
>>>>> 2016-04-30 18:04 GMT+01:00 Andy Riebs <andy.ri...@hpe.com 
>>>>> <mailto:andy.ri...@hpe.com>>:
>>>>> For Slurm, after the "make install", did you do a "make install-contrib" 
>>>>> (which builds the pmi2 support)? I think you would have seen a runtime 
>>>>> error if you hadn't, but possibly not.
>>>>> 
>>>>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote:
>>

[slurm-dev] Re: MPI/OpenMPI send receive not working

2016-04-30 Thread Ralph Castain

As I said, your original program has a bug in it - you are using “rank” values 
that are invalid. This is why it is failing when run under mpirun.

This second problem is caused by your SLURM integration to OMPI being broken, 
probably due to not correctly linking the PMI support


> On Apr 30, 2016, at 11:56 AM, Mehdi Acheli <cm_ach...@esi.dz> wrote:
> 
> Yes, if I use "salloc -N2 sh" and then launch the job via mpirun, the hello 
> world program is doing well. However my original program is still blocking on 
> the send and receive lines.
> 
> 2016-04-30 19:47 GMT+01:00 Ralph Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>>:
> Your slurm-OMPI integration is clearly broken - the processes do not realize 
> they are operating in a common world. Does it work if you use mpirun instead 
> of srun?
> 
> 
>> On Apr 30, 2016, at 11:28 AM, Mehdi Acheli <cm_ach...@esi.dz 
>> <mailto:cm_ach...@esi.dz>> wrote:
>> 
>> No, I just tested another program and it seems that the world_size is 
>> reduced to one even though i launch the job on two nodes. The hello program 
>> is doing the same. Well, I am completely lost now.
>> 
>> 
>> 
>> 
>> 2016-04-30 19:09 GMT+01:00 Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>>:
>> This looks like a bug in your program - you specified an invalid rank when 
>> attempting to send.
>> 
>>> On Apr 30, 2016, at 10:59 AM, Mehdi Acheli <cm_ach...@esi.dz 
>>> <mailto:cm_ach...@esi.dz>> wrote:
>>> 
>>> I just did. Permit me to include a capture of the script output file: 
>>> 
>>> 
>>> 
>>> I specify in my script the option "-N 2", but it looks like the world_size 
>>> is composed of only one process and both nodes are trying to execute an 
>>> MPI_Send !
>>> 
>>> 2016-04-30 18:41 GMT+01:00 Andy Riebs <andy.ri...@hpe.com 
>>> <mailto:andy.ri...@hpe.com>>:
>>> Aha! I missed it the first time... In your script, replace "mpirun" with 
>>> "srun" and the world should be better.
>>> 
>>> On 04/30/2016 01:35 PM, Mehdi Acheli wrote:
>>>> Euh, I did a "make all install" so I think pmi support is installed. And 
>>>> the hello world program is working, would it if it wasn't installed ?
>>>> 
>>>> 2016-04-30 18:04 GMT+01:00 Andy Riebs <andy.ri...@hpe.com 
>>>> <mailto:andy.ri...@hpe.com>>:
>>>> For Slurm, after the "make install", did you do a "make install-contrib" 
>>>> (which builds the pmi2 support)? I think you would have seen a runtime 
>>>> error if you hadn't, but possibly not.
>>>> 
>>>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote:
>>>>> First of all, thank you for the reaction.
>>>>> 
>>>>> Here are the answers :
>>>>> I tried multiple commands:
>>>>> I started with "srun -N2 --mpi=pmi2 ptest" then I changed the 
>>>>> slurm.conf's mpi parameter to pmi2 so I no longer need the option.
>>>>> I also tried a script submitted via sbatch. It doesn't work either and 
>>>>> squeue shows that it's running. My program is just passing a number from 
>>>>> node 1 to node 2 so it doesn't normally take that long.
>>>>> OpenMPI version is 1.10.2 / SLURM's is 15.08.8
>>>>> I built Slurm myself with no specific options. For OpenMPI I actually 
>>>>> downloaded it from the CentOS 7 default repo. But I tried building the 
>>>>> same version before with --with-slurm and --with-pmi options, yet it 
>>>>> wasn't working either.Ã¯Â¿Â½
>>>>> I am joining a copy of my slurm.conf file and the script I used to submit 
>>>>> the job.
>>>>> 
>>>>> The script :Ã¯Â¿Â½
>>>>> 
>>>>> #!/bin/bash
>>>>> #
>>>>> #SBATCH --job-name=test
>>>>> #SBATCH --output=res_mpi.txt
>>>>> #
>>>>> #SBATCH -N 2
>>>>> module load openmpi
>>>>> mpirun test
>>>>> 
>>>>> Slurm.conf file :
>>>>> 
>>>>> 
>>>>> # slurm.conf file generated by configurator easy.html.
>>>>> # Put this file on all nodes of your cluster.
>>>>> # See the slurm.conf man page for more information.
>>>>> #
>>>>> ControlMachine=m
>>>>> ControlAddr=m
>&g

[slurm-dev] Re: MPI/OpenMPI send receive not working

2016-04-30 Thread Ralph Castain

Your slurm-OMPI integration is clearly broken - the processes do not realize 
they are operating in a common world. Does it work if you use mpirun instead of 
srun?


> On Apr 30, 2016, at 11:28 AM, Mehdi Acheli <cm_ach...@esi.dz 
> <mailto:cm_ach...@esi.dz>> wrote:
> 
> No, I just tested another program and it seems that the world_size is reduced 
> to one even though i launch the job on two nodes. The hello program is doing 
> the same. Well, I am completely lost now.
> 
> 
> 
> 
> 2016-04-30 19:09 GMT+01:00 Ralph Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>>:
> This looks like a bug in your program - you specified an invalid rank when 
> attempting to send.
> 
>> On Apr 30, 2016, at 10:59 AM, Mehdi Acheli <cm_ach...@esi.dz 
>> <mailto:cm_ach...@esi.dz>> wrote:
>> 
>> I just did. Permit me to include a capture of the script output file: 
>> 
>> 
>> 
>> I specify in my script the option "-N 2", but it looks like the world_size 
>> is composed of only one process and both nodes are trying to execute an 
>> MPI_Send !
>> 
>> 2016-04-30 18:41 GMT+01:00 Andy Riebs <andy.ri...@hpe.com 
>> <mailto:andy.ri...@hpe.com>>:
>> Aha! I missed it the first time... In your script, replace "mpirun" with 
>> "srun" and the world should be better.
>> 
>> On 04/30/2016 01:35 PM, Mehdi Acheli wrote:
>>> Euh, I did a "make all install" so I think pmi support is installed. And 
>>> the hello world program is working, would it if it wasn't installed ?
>>> 
>>> 2016-04-30 18:04 GMT+01:00 Andy Riebs <andy.ri...@hpe.com 
>>> <mailto:andy.ri...@hpe.com>>:
>>> For Slurm, after the "make install", did you do a "make install-contrib" 
>>> (which builds the pmi2 support)? I think you would have seen a runtime 
>>> error if you hadn't, but possibly not.
>>> 
>>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote:
>>>> First of all, thank you for the reaction.
>>>> 
>>>> Here are the answers :
>>>> I tried multiple commands:
>>>> I started with "srun -N2 --mpi=pmi2 ptest" then I changed the slurm.conf's 
>>>> mpi parameter to pmi2 so I no longer need the option.
>>>> I also tried a script submitted via sbatch. It doesn't work either and 
>>>> squeue shows that it's running. My program is just passing a number from 
>>>> node 1 to node 2 so it doesn't normally take that long.
>>>> OpenMPI version is 1.10.2 / SLURM's is 15.08.8
>>>> I built Slurm myself with no specific options. For OpenMPI I actually 
>>>> downloaded it from the CentOS 7 default repo. But I tried building the 
>>>> same version before with --with-slurm and --with-pmi options, yet it 
>>>> wasn't working either.Ã¯Â¿Â½
>>>> I am joining a copy of my slurm.conf file and the script I used to submit 
>>>> the job.
>>>> 
>>>> The script :Ã¯Â¿Â½
>>>> 
>>>> #!/bin/bash
>>>> #
>>>> #SBATCH --job-name=test
>>>> #SBATCH --output=res_mpi.txt
>>>> #
>>>> #SBATCH -N 2
>>>> module load openmpi
>>>> mpirun test
>>>> 
>>>> Slurm.conf file :
>>>> 
>>>> 
>>>> # slurm.conf file generated by configurator easy.html.
>>>> # Put this file on all nodes of your cluster.
>>>> # See the slurm.conf man page for more information.
>>>> #
>>>> ControlMachine=m
>>>> ControlAddr=m
>>>> BackupController=mb
>>>> BackupAddr=mb
>>>> #
>>>> #MailProg=/bin/mail
>>>> MpiDefault=pmi2
>>>> MpiParams=ports=12000-12999
>>>> ProctrackType=proctrack/linuxproc
>>>> ReturnToService=2
>>>> #SlurmctldPidFile=/var/run/slurmctld.pid
>>>> #SlurmctldPort=6817
>>>> #SlurmdPidFile=/var/run/slurmd.pid
>>>> #SlurmdPort=6818
>>>> SlurmdSpoolDir=/var/spool/slurm/slurmd
>>>> SlurmUser=slurm
>>>> #SlurmdUser=root
>>>> #StateSaveLocation=/var/spool/slurm
>>>> StateSaveLocation=/mnt/data/spool/slurm
>>>> SwitchType=switch/none
>>>> TaskPlugin=task/none
>>>> #
>>>> #
>>>> # TIMERS
>>>> #KillWait=30
>>>> #MinJobAge=300
>>>> #SlurmctldTimeout=120
>>>> #SlurmdTimeout=300
>>>> #
>>>> #
>>>> # SC

[slurm-dev] Re: MPI/OpenMPI send receive not working

2016-04-30 Thread Ralph Castain

This looks like a bug in your program - you specified an invalid rank when 
attempting to send.

> On Apr 30, 2016, at 10:59 AM, Mehdi Acheli  wrote:
> 
> I just did. Permit me to include a capture of the script output file: 
> 
> 
> 
> I specify in my script the option "-N 2", but it looks like the world_size is 
> composed of only one process and both nodes are trying to execute an MPI_Send 
> !
> 
> 2016-04-30 18:41 GMT+01:00 Andy Riebs  >:
> Aha! I missed it the first time... In your script, replace "mpirun" with 
> "srun" and the world should be better.
> 
> On 04/30/2016 01:35 PM, Mehdi Acheli wrote:
>> Euh, I did a "make all install" so I think pmi support is installed. And the 
>> hello world program is working, would it if it wasn't installed ?
>> 
>> 2016-04-30 18:04 GMT+01:00 Andy Riebs > >:
>> For Slurm, after the "make install", did you do a "make install-contrib" 
>> (which builds the pmi2 support)? I think you would have seen a runtime error 
>> if you hadn't, but possibly not.
>> 
>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote:
>>> First of all, thank you for the reaction.
>>> 
>>> Here are the answers :
>>> I tried multiple commands:
>>> I started with "srun -N2 --mpi=pmi2 ptest" then I changed the slurm.conf's 
>>> mpi parameter to pmi2 so I no longer need the option.
>>> I also tried a script submitted via sbatch. It doesn't work either and 
>>> squeue shows that it's running. My program is just passing a number from 
>>> node 1 to node 2 so it doesn't normally take that long.
>>> OpenMPI version is 1.10.2 / SLURM's is 15.08.8
>>> I built Slurm myself with no specific options. For OpenMPI I actually 
>>> downloaded it from the CentOS 7 default repo. But I tried building the same 
>>> version before with --with-slurm and --with-pmi options, yet it wasn't 
>>> working either.Ã¯Â¿Â½
>>> I am joining a copy of my slurm.conf file and the script I used to submit 
>>> the job.
>>> 
>>> The script :Ã¯Â¿Â½
>>> 
>>> #!/bin/bash
>>> #
>>> #SBATCH --job-name=test
>>> #SBATCH --output=res_mpi.txt
>>> #
>>> #SBATCH -N 2
>>> module load openmpi
>>> mpirun test
>>> 
>>> Slurm.conf file :
>>> 
>>> 
>>> # slurm.conf file generated by configurator easy.html.
>>> # Put this file on all nodes of your cluster.
>>> # See the slurm.conf man page for more information.
>>> #
>>> ControlMachine=m
>>> ControlAddr=m
>>> BackupController=mb
>>> BackupAddr=mb
>>> #
>>> #MailProg=/bin/mail
>>> MpiDefault=pmi2
>>> MpiParams=ports=12000-12999
>>> ProctrackType=proctrack/linuxproc
>>> ReturnToService=2
>>> #SlurmctldPidFile=/var/run/slurmctld.pid
>>> #SlurmctldPort=6817
>>> #SlurmdPidFile=/var/run/slurmd.pid
>>> #SlurmdPort=6818
>>> SlurmdSpoolDir=/var/spool/slurm/slurmd
>>> SlurmUser=slurm
>>> #SlurmdUser=root
>>> #StateSaveLocation=/var/spool/slurm
>>> StateSaveLocation=/mnt/data/spool/slurm
>>> SwitchType=switch/none
>>> TaskPlugin=task/none
>>> #
>>> #
>>> # TIMERS
>>> #KillWait=30
>>> #MinJobAge=300
>>> #SlurmctldTimeout=120
>>> #SlurmdTimeout=300
>>> #
>>> #
>>> # SCHEDULING
>>> FastSchedule=1
>>> SchedulerType=sched/backfill
>>> #SchedulerPort=7321
>>> SelectType=select/linear
>>> PreemptType=preempt/partition_prio
>>> PreemptMode=requeue
>>> #
>>> #
>>> # LOGGING AND ACCOUNTING
>>> AccountingStorageType=accounting_storage/slurmdbd
>>> #JobAcctGatherFrequency=30
>>> JobAcctGatherType=jobacct_gather/linux
>>> JobCompType=jobcomp/none
>>> #SlurmctldDebug=3
>>> #SlurmctldLogFile=/var/log/slurmctld.log
>>> SlurmctldLogFile=/mnt/data/log/slurmctld.log
>>> #SlurmdDebug=3
>>> SlurmdLogFile=/var/log/slurmd.log
>>> AccountingStorageBackupHost=mb
>>> #
>>> #
>>> # COMPUTE NODES
>>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN
>>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN
>>> 
>>> 2016-04-30 16:40 GMT+01:00 Andy Riebs < 
>>> andy.ri...@hpe.com >:
>>> Hi,
>>> 
>>> The one problem that I see in your description is minor, and probably not 
>>> significant: the MPI ports parameter was needed for very old versions of 
>>> Open MPI, IIRC.
>>> 
>>> To help debug your problems, please respond to this list with
>>> What command did you use to invoke your program?
>>> What versions of Slurm and OpenMPI are you using?
>>> Did you build them yourself, or use prebuilt versions?
>>> If you built them yourself, what configuration options did you use?
>>> If pre-built versions, where did you get them?
>>> A copy of your slurm.conf file (you may want to change node names and other 
>>> potentially sensitive information)
>>> Andy
>>> 
>>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote:
 Hello everyone,
 
 I've set a basic configuration usingÃƒÂ¯Ã‚Â¿Ã‚Â½slurmÃƒÂ¯Ã‚Â¿Ã‚Â½with a 
 master node, backup node, a login node and eight compute node.
 
 Everything inÃƒÂ¯Ã‚Â¿Ã‚Â½slurmÃƒÂ¯Ã‚Â¿Ã‚Â½is working fine. I can issue 
 jobs and see the state

[slurm-dev] Re: checkpoint/restart feature in SLURM

2016-03-20 Thread Ralph Castain

I am not aware of any MPI that would allow you to relocate a process while the 
job is running. You have to checkpoint it, terminate it, and then restart the 
entire job with the new node included.

> On Mar 16, 2016, at 9:58 PM, Husen R  wrote:
> 
> Dear Slurm-dev,
> 
> 
> Does checkpoint/restart feature available in SLURM able to relocate MPI 
> application from one node to another node while it is running ?
> 
> For the example, I run MPI application in node A,B and C in a cluster and I 
> want to migrate/relocate process running in node A to other node, let's say 
> to node C while it is running. 
> 
> is there a way to do this with SLURM ? Thank you.
> 
> 
> Regards,
> 
> Husen

[slurm-dev] Re: slurm-dev SLURM PMI2 performance vs. mpirun/mpiexec (was: Re: Re: more detailed installation guide)

2016-01-07 Thread Ralph Castain

Just following up as promised with some data. The graphs below were generated 
using the SLURM master with the PMIx plugin based on PMIx v1.1.0, running 64 
procs/node, using a simple MPI_Init/MPI_Finalize app. The blue line used srun 
to start the job, and used PMI-2. The red line also was started by srun, but 
used PMIx. As you can see, there is some performance benefit from use of PMIx.

The gray line used srun to start the job and the PMIx plugin, but also used the 
new optional features to reduce the startup time. There are two features:

(a) we only do a modex “recv” (i.e., a PMI-get) upon first communication to a 
specific peer

(b) the modex itself (i.e., pmi_fence) operation simply drops thru - we do not 
execute a barrier. Instead, there is an async exchange of the data. We only 
block when the proc requests a specific piece of data


The final yellow line is mpirun (which uses PMIx) using the new optional 
features. As you can see, it’s a little faster than srun-based launch.

We are extending these tests to larger scale, and continuing to push the 
performance as discussed before.

HTH
Ralph





> On Jan 6, 2016, at 11:58 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> 
> 
>> On Jan 6, 2016, at 9:31 PM, Novosielski, Ryan <novos...@ca.rutgers.edu> 
>> wrote:
>> 
>> 
>>> On Jan 6, 2016, at 23:31, Christopher Samuel <sam...@unimelb.edu.au> wrote:
>>> 
>>>> On 07/01/16 01:03, Novosielski, Ryan wrote:
>>>> 
>>>> Since this is an audience that might know, and this is related (but
>>>> off-topic, sorry): is there any truth to the suggestions on the Internet
>>>> that using srun is /slower/ than mpirun/mpiexec?
>>> 
>>> In our experience Open-MPI 1.6.x and earlier (PMI-1 support) is slower
>>> with srun than with mpirun.  This was tested with NAMD.
>>> 
>>> Open-MPI 1.8.x and later with PMI-2 is about the same with srun as with
>>> mpirun.
>> 
>> Thanks very much to both of you who have responded with an answer to this 
>> question. Both of you have said "about the same" if I'm not mistaken. So I 
>> guess they're still is a very slight performance penalty to using PMI2 
>> instead of mpirun? Probably worth it anyway, but I'm just curious to know 
>> the real score. Not a lot of info about this other than the mailing list.
> 
> FWIW: the reason the gap closed when going from the (1.6 vs srun+PMI1) to the 
> (1.8 vs srun+PMI2) scenario is partly because of the PMI-1 vs PMI-2 
> difference, but also because OMPI’s mpirun slowed down significantly between 
> the 1.6 and 1.8 series. We didn’t catch the loss of performance in time, but 
> are addressing it for the upcoming 2.0 series.
> 
> In 2.0, mpirun will natively use PMIx, and you can additionally use two new 
> optional features to dramatically improve the launch time. I’ll provide a 
> graph tomorrow to show the different performance vs PMI-2 even at small 
> scale. Those features may become the default behavior at some point - hasn’t 
> fully been decided yet as they need time to mature.
> 
> However, the situation is fluid. Using the SLURM PMix plugin (in master now 
> and tentatively scheduled for release later this year) will effectively close 
> the gap. Somewhere in that same timeframe, OMPI will be implementing further 
> improvements to mpirun (using fabric instead of mgmt Ethernet to perform 
> barriers, distributing the launch mapping procedure, etc.) and will likely 
> move ahead again - and then members of the PMIx community are already 
> planning to propose some of those changes for SLURM. If accepted, you’ll see 
> the gap close again.
> 
> So I expect to see this surge and recover pattern to continue for the next 
> couple of years, with mpirun ahead for awhile and then even with SLURM when 
> using the PMIx plugin.
> 
> HTH - and I’ll provide the graph in the morning.
> Ralph
> 
> 
> 
>> 
>> Thanks again.

[slurm-dev] Re: more detailed installation guide

2016-01-06 Thread Ralph Castain

As with all such rumors, there is some truth and some inaccuracies to it. Note 
that the various MPIs have historically differed significantly in how they 
implement mpirun, though the differences in terms of behavior and performance 
have been closing. So it is hard to provide a clearcut answer that spans time, 
and I’ll just report where we are now and looking ahead a bit.

PMI-1 support doesn't scale as well as what was done in mpirun from some of the 
MPI libraries, and so your (A) is certainly true. Remember that Slurm provides 
PMI-1 out-of-the-box and that you have to do a second build step to add PMI-2 
support. So for people that just do the std install and run, this will be the 
expected situation.

For those that install PMI-2 (or the new extended PMI-2 for MVAPICH), you’ll 
see some improved performance. I suspect you’ll find that srun and mpirun are 
pretty close to each other at that point, and the choice really just comes down 
to your desired cmd line options.

The test results with PMIx indicate that the performance gap between direct 
(srun) launch and indirect (mpirun) launch is pretty much gone. You have to 
remember that the overhead of mapping the job isn’t very large (and the time is 
roughly equal anyway), and that both srun and mpirun distribute the launch cmd 
in the same way (via a tree-based algorithm). Likewise, both involve starting a 
user-level daemon and wiring those up.

So when you break down the steps, and given that mpirun and srun are using the 
same wireup support, you can see that the two should be equivalent. Really just 
a question of which cmd line options you prefer.

HTH
Ralph

> On Jan 6, 2016, at 6:03 AM, Novosielski, Ryan <novos...@ca.rutgers.edu 
> <mailto:novos...@ca.rutgers.edu>> wrote:
> 
> Since this is an audience that might know, and this is related (but 
> off-topic, sorry): is there any truth to the suggestions on the Internet that 
> using srun is /slower/ than mpirun/mpiexec? There were some old mailing list 
> messages someplace that seem to indicate A) yes, in the old days of PMI1 only 
> or B) likely it was a misconfigured system in the first place. I haven't 
> found anything definitive though and those threads sort of petered out 
> without an answer. 
> 
>  *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
> || \\UTGERS   |-*O*-
> ||_// Biomedical | Ryan Novosielski - Senior Technologist
> || \\ and Health | novos...@rutgers.edu <mailto:novos...@rutgers.edu>- 
> 973/972.0922 (2x0922)
> ||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
> `'
> 
> On Jan 6, 2016, at 01:43, Ralph Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>> wrote:
> 
>> 
>> Simple reason, Chris - the PMI support is GPL 2.0, and so anything built 
>> against it automatically becomes GPL. So OpenHPC cannot distribute Slurm 
>> with those libraries.
>> 
>> Instead, we are looking to use the new PMIx library to provide wireup 
>> support, which includes backward support for PMI 1 and 2. I’m supposed to 
>> complete that backport in my copious free time :-)
>> 
>> Until then, you can only launch via mpirun - which is just as fast, 
>> actually, but does indeed have different cmd line options.
>> 
>> 
>>> On Jan 5, 2016, at 9:22 PM, Christopher Samuel <sam...@unimelb.edu.au 
>>> <mailto:sam...@unimelb.edu.au>> wrote:
>>> 
>>> 
>>> On 06/01/16 01:46, David Carlet wrote:
>>> 
>>>> Depending on where you are in the design/development phase for your
>>>> project, you might also consider switching to using the OpenHPC build.
>>> 
>>> Caution: for reasons that are unclear OpenHPC disables Slurm PMI support:
>>> 
>>> https://github.com/openhpc/ohpc/releases/download/v1.0.GA/Install_guide-CentOS7.1-1.0.pdf
>>>  
>>> <https://github.com/openhpc/ohpc/releases/download/v1.0.GA/Install_guide-CentOS7.1-1.0.pdf>
>>> 
>>> # At present, OpenHPC is unable to include the PMI process
>>> # management server normally included within Slurm which
>>> # implies that srun cannot be use for MPI job launch. Instead,
>>> # native job launch mechanisms provided by the MPI stacks are
>>> # utilized and prun abstracts this process for the various
>>> # stacks to retain a single launch command.
>>> 
>>> Their spec file does:
>>> 
>>> # 6/16/15 karl.w.sch...@intel.com <mailto:karl.w.sch...@intel.com> - do not 
>>> package Slurm's version of libpmi with OpenHPC.
>>> %if 0%{?OHPC_BUILD}
>>>  rm -f $RPM_BUILD_ROOT/%{_libdir}/libpmi*
>>>  rm -f $RPM_BUILD_ROOT/%{_libdir}/mpi_pmi2*
>>> %endif
>>> 
>>> 
>>> 
>>> -- 
>>> Christopher SamuelSenior Systems Administrator
>>> VLSCI - Victorian Life Sciences Computation Initiative
>>> Email: sam...@unimelb.edu.au <mailto:sam...@unimelb.edu.au> Phone: +61 (0)3 
>>> 903 55545
>>> http://www.vlsci.org.au/ <http://www.vlsci.org.au/>  
>>> http://twitter.com/vlsci <http://twitter.com/vlsci>

[slurm-dev] Re: more detailed installation guide

2016-01-06 Thread Ralph Castain



> On Jan 6, 2016, at 6:33 PM, Christopher Samuel <sam...@unimelb.edu.au> wrote:
> 
> 
> On 06/01/16 17:14, Ralph Castain wrote:
> 
>> Simple reason, Chris - the PMI support is GPL 2.0, and so anything
>> built against it automatically becomes GPL.
> 
> My understanding is that's only the case if the only implementation(s)
> is/are under the GPL.  If there is a BSD implementation (for instance)
> then other code that uses it is not a derivative work of a GPL application.
> 
> For example that's apparently why the fgmp library was created as a BSD
> version of the GNU GMP library:
> 
> https://lwn.net/Articles/548576/
> 
> 

That is true, Chris - but the SLURM PMI implementations are all GPL 2.0. The 
PMIx implementation is BSD

> All the best,
> Chris
> -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci

[slurm-dev] Re: more detailed installation guide

2016-01-06 Thread Ralph Castain



> On Jan 6, 2016, at 8:38 PM, Christopher Samuel <sam...@unimelb.edu.au> wrote:
> 
> 
> On 07/01/16 14:09, Ralph Castain wrote:
> 
>> That is true, Chris - but the SLURM PMI implementations are all GPL 2.0.
> 
> But Slurm is only one implementation of PMI, the MPICH FAQ says:
> 
> https://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_What_are_process_managers.3F
> 
> # These process managers communicate with MPICH processes using
> # a predefined interface called as PMI (process management interface).
> # Since the interface is (informally) standardized within MPICH and
> # its derivatives, you can use any process manager from MPICH or its
> # derivatives with any MPI application built with MPICH or any of its
> # derivatives, as long as they follow the same wire protocol. There
> # are three known implementations of the PMI wire protocol: "simple",
> # "smpd" and "slurm". By default, MPICH and all its derivatives use
> # the "simple" PMI wire protocol, but MPICH can be configured to use
> # "smpd" or "slurm" as well.
> 
> and perhaps more tellingly:
> 
> # SLURM is an external process manager that uses MPICH's PMI interface
> # as well. 
> 
> Given that and the fact that smpd and hydra is/was/are distributed in
> MPICH under a BSD style permissive license I don't think there's
> anything to worry about.

No offense, and IANAL, but I fear you are incorrect. Note that I’m talking 
strictly about launch via srun using the SLURM PMI support, not mpiexec. 
However, we can take this offline and stop spanning the list with this 
discussion if you like.

Believe me - this has been a long-running point of discussion spanning the last 
10 years and involving multiple legal depts. Suffice to say that it has been 
analyzed to death. :-(

> 
> All the best,
> Chris
> -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci

[slurm-dev] Re: more detailed installation guide

2016-01-06 Thread Ralph Castain


> On Jan 6, 2016, at 12:27 PM, Bruce Roberts <schedulerk...@gmail.com> wrote:
> 
> 
> 
> On 01/06/16 11:53, Ralph Castain wrote:
>> 
>>> On Jan 6, 2016, at 9:53 AM, Bruce Roberts <schedulerk...@gmail.com 
>>> <mailto:schedulerk...@gmail.com>> wrote:
>>> 
>>> PMIx sounds really nice.
>>> 
>>> Forgive my naive question, but for mpirun would sstat and step accounting 
>>> continue to work as it does when using srun?
>> 
>> It does to an extent. You generally execute an mpirun for each job step. The 
>> mpirun launches its own daemons for each invocation, and the app procs are 
>> children of these daemons. So Slurm sees the daemons and will aggregate the 
>> accounting for its children into the daemon’s usage. However, the daemon 
>> mostly just sleeps once the app is running, and so the accounting should be 
>> okay (though you won’t get it for each individual app process).
>> 
>> Perhaps others out there who have used this can chime in with their 
>> experience?
> So it sounds like mpirun must use srun under the covers to launch the 
> daemons.  If using the cgroups proctrack plugin I'm guessing accounting will 
> work for the entire step just not down to the individual ranks as you state.  
> sstat probably isn't that useful at that point.  That is a large difference.

True on all accounts

>> 
>>>   Does mpirun also support Slurm's task placement/layout/binding/signaling? 
>>>  Our users use most of the features quite heavily as I am guessing others 
>>> do as well.
>> 
>> What mpirun supports depends on the MPI implementation, so I can only 
>> address your question for OpenMPI. You’ll find that OMPI’s mpirun provides a 
>> superset of Slurms options (i.e., we implemented a broader level of 
>> support), but the names and syntax of those options is different as it 
>> reflects that broader support. For example, we have the ability to allow 
>> more fine-grained layout/binding patterns and combinations.
> That is interesting.  I am able to lay out tasks in any order I would like 
> with srun on cores or threads of cores.  What finer-grained layout/binding 
> patterns are you referring to?

I believe srun only supports some specific patterns (e.g., cyclic), and those 
patterns combine placement and rank assignment. We separate out all three 
phases of mapping so you can control each independently:

* location (our “map-by” option) determines how the procs are laid out. 
Includes the ability to specify #cpus/proc. Note that you can also specify 
whether you want to treat cores as cpus, or use individual HTs as independent 
cpus (if HT is enabled)

* rank assignment (“rank-by”) allows you to define different algorithms for 
assigning the ranks to those procs. Depending on how the app has laid out its 
communication, this can sometimes be helpful

* binding (“bind-to”) let’s you bind the resulting location to whatever level 
you want

We also have a “rank_file” mapper that let’s you specify exact proc location on 
a rank-by-rank basis, and a “sequential” mapper that takes the list of hosts 
and places sequential ranks on each one in that specific order (i.e., you can 
get totally non-cyclic locations)

The resulting pattern map is rather large, and I fully confess that all the 
options aren’t regularly used. Most users just let the default algorithms run. 
However, researchers continue to explore the performance impact of these 
options, and every so often someone finds some measurable performance 
improvement by laying a particular application out in a new manner. So we 
maintain the flexibility.

HTH
Ralph


> 
> Thanks for your insights so far they helpful!
>> 
>> 
>>> 
>>> Thanks!
>>> 
>>> On 01/06/16 07:54, Ralph Castain wrote:
>>>> As with all such rumors, there is some truth and some inaccuracies to it. 
>>>> Note that the various MPIs have historically differed significantly in how 
>>>> they implement mpirun, though the differences in terms of behavior and 
>>>> performance have been closing. So it is hard to provide a clearcut answer 
>>>> that spans time, and I’ll just report where we are now and looking ahead a 
>>>> bit.
>>>> 
>>>> PMI-1 support doesn't scale as well as what was done in mpirun from some 
>>>> of the MPI libraries, and so your (A) is certainly true. Remember that 
>>>> Slurm provides PMI-1 out-of-the-box and that you have to do a second build 
>>>> step to add PMI-2 support. So for people that just do the std install and 
>>>> run, this will be the expected situation.
>>>> 
>>>> For those that install PMI-

[slurm-dev] Re: more detailed installation guide

2016-01-05 Thread Ralph Castain


Simple reason, Chris - the PMI support is GPL 2.0, and so anything built 
against it automatically becomes GPL. So OpenHPC cannot distribute Slurm with 
those libraries.

Instead, we are looking to use the new PMIx library to provide wireup support, 
which includes backward support for PMI 1 and 2. I’m supposed to complete that 
backport in my copious free time :-)

Until then, you can only launch via mpirun - which is just as fast, actually, 
but does indeed have different cmd line options.


> On Jan 5, 2016, at 9:22 PM, Christopher Samuel  wrote:
> 
> 
> On 06/01/16 01:46, David Carlet wrote:
> 
>> Depending on where you are in the design/development phase for your
>> project, you might also consider switching to using the OpenHPC build.
> 
> Caution: for reasons that are unclear OpenHPC disables Slurm PMI support:
> 
> https://github.com/openhpc/ohpc/releases/download/v1.0.GA/Install_guide-CentOS7.1-1.0.pdf
> 
> # At present, OpenHPC is unable to include the PMI process
> # management server normally included within Slurm which
> # implies that srun cannot be use for MPI job launch. Instead,
> # native job launch mechanisms provided by the MPI stacks are
> # utilized and prun abstracts this process for the various
> # stacks to retain a single launch command.
> 
> Their spec file does:
> 
> # 6/16/15 karl.w.sch...@intel.com - do not package Slurm's version of libpmi 
> with OpenHPC.
> %if 0%{?OHPC_BUILD}
>   rm -f $RPM_BUILD_ROOT/%{_libdir}/libpmi*
>   rm -f $RPM_BUILD_ROOT/%{_libdir}/mpi_pmi2*
> %endif
> 
> 
> 
> -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci

[slurm-dev] Re: more detailed installation guide

2016-01-05 Thread Ralph Castain

I’m sure it is - it’s a simple tool ala mpirun, and I can’t imagine they ship 
without it. You might make sure you have your path correctly set. I found it on 
my Centos7 box here:

/usr/lib64/openmpi/bin/ompi_info



> On Jan 5, 2016, at 4:08 PM, Simpson Lachlan  
> wrote:
> 
>> I very much doubt Centos7 would build something Slurm-specific for their 
>> default
>> pkg, but you could do an “ompi_info” and see if the pmi components were built
> 
> 
> It would seem that the Centos openmpi installation doesn't come with 
> ompi_info installed.
> 
> SoI guess you are right :)
>   
> Cheers
> L.
> 
> 
>> 
>> 
>>> On Jan 5, 2016, at 2:03 PM, Simpson Lachlan
>>  wrote:
>>> 
>>> 
 Just one comment regarding openmpi building:
 https://wiki.fysik.dtu.dk/niflheim/SLURM#mpi-setup - At least with
 regard to openmpi, it should be built --with-pmi
>>> 
>>> Actually, on this - does OpenMPI *need* to be built especially on Centos7 
>>> or is it
>> built --with-pmi in the repos?
>>> 
>>> Cheers
>>> L.
>>> This email (including any attachments or links) may contain
>>> confidential and/or legally privileged information and is intended
>>> only to be read or used by the addressee.  If you are not the intended
>>> addressee, any use, distribution, disclosure or copying of this email
>>> is strictly prohibited.
>>> Confidentiality and legal privilege attached to this email (including
>>> any attachments) are not waived or lost by reason of its mistaken
>>> delivery to you.
>>> If you have received this email in error, please delete it and notify
>>> us immediately by telephone or email.  Peter MacCallum Cancer Centre
>>> provides no guarantee that this transmission is free of virus or that
>>> it has not been intercepted or altered and will not be liable for any
>>> delay in its receipt.
>>> 
> This email (including any attachments or links) may contain 
> confidential and/or legally privileged information and is 
> intended only to be read or used by the addressee.  If you 
> are not the intended addressee, any use, distribution, 
> disclosure or copying of this email is strictly 
> prohibited.  
> Confidentiality and legal privilege attached to this email 
> (including any attachments) are not waived or lost by 
> reason of its mistaken delivery to you.
> If you have received this email in error, please delete it 
> and notify us immediately by telephone or email.  Peter 
> MacCallum Cancer Centre provides no guarantee that this 
> transmission is free of virus or that it has not been 
> intercepted or altered and will not be liable for any delay 
> in its receipt.

[slurm-dev] Re: more detailed installation guide

2016-01-05 Thread Ralph Castain


I very much doubt Centos7 would build something Slurm-specific for their 
default pkg, but you could do an “ompi_info” and see if the pmi components were 
built


> On Jan 5, 2016, at 2:03 PM, Simpson Lachlan  
> wrote:
> 
> 
>> Just one comment regarding openmpi building:
>> https://wiki.fysik.dtu.dk/niflheim/SLURM#mpi-setup - At least with regard to
>> openmpi, it should be built --with-pmi
> 
> Actually, on this - does OpenMPI *need* to be built especially on Centos7 or 
> is it built --with-pmi in the repos?
> 
> Cheers
> L.
> This email (including any attachments or links) may contain 
> confidential and/or legally privileged information and is 
> intended only to be read or used by the addressee.  If you 
> are not the intended addressee, any use, distribution, 
> disclosure or copying of this email is strictly 
> prohibited.  
> Confidentiality and legal privilege attached to this email 
> (including any attachments) are not waived or lost by 
> reason of its mistaken delivery to you.
> If you have received this email in error, please delete it 
> and notify us immediately by telephone or email.  Peter 
> MacCallum Cancer Centre provides no guarantee that this 
> transmission is free of virus or that it has not been 
> intercepted or altered and will not be liable for any delay 
> in its receipt.
>

[slurm-dev] Re: srun and openmpi

2015-12-16 Thread Ralph Castain


Yes, mpirun adds that to the environment to ensure we don’t pickup the wrong 
ess component. Try adding “OMPI_MCA_ess_base_verbose=10” to your environment 
and just srun one copy of hello_world - let’s ensure it picked up the right ess 
component.

I can try to replicate here, but it will take me a little while to get to it

> On Dec 16, 2015, at 8:50 AM, Michael Di Domenico <mdidomeni...@gmail.com> 
> wrote:
> 
> 
> Yes, i have PMI support included into openmpi
> 
> --with-slurm --with-psm --with-pmi=/opt/slurm
> 
> checking through the config.log it does appear the PMI tests build 
> successfully.
> 
> though checking with ompi_info i'm not sure i can say with 100%
> certainty it's in there.
> 
> ompi_info --parseable | grep pmi does return
> 
> mca:db:pmi
> mca:ess:pmi
> mca:grpcomm:pmi
> mca:pubsub:pmi
> 
> interestingly enough when i run 'mpirun env' (no slurm), i see ^pmi in
> the OMPI environment variables, but i'm not sure if that's supposed to
> do that or not
> 
> 
> 
> On 12/16/15, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> Hey Michael
>> 
>> Check ompi_info and ensure that the PMI support built - you have to
>> explicitly ask for it and provide the path to pmi.h
>> 
>> 
>>> On Dec 16, 2015, at 6:48 AM, Michael Di Domenico <mdidomeni...@gmail.com>
>>> wrote:
>>> 
>>> 
>>> i just compiled and installed Slurm 14.11.4 and Openmpi 1.10.0.  but i
>>> seem to have an srun oddity i've not seen before and i'm not exactly
>>> sure how to debug it
>>> 
>>> srun -n 4 hello_world
>>> - does not run, hangs in MPI_INIT
>>> 
>>> srun -n 4 -N1 hello_world
>>> - does not run, hangs in MPI_INIT
>>> 
>>> srun -n 4 -N 4
>>> - runs one task per node
>>> 
>>> sbatch and salloc seem to work okay launching using mpirun inside, and
>>> mpirun works without issue outside of slurm
>>> 
>>> i disabled all the gres and cgroup controls and all that
>>> 
>>> has anyone seen this before?
>>

[slurm-dev] Re: srun and openmpi

2015-12-16 Thread Ralph Castain


Hey Michael

Check ompi_info and ensure that the PMI support built - you have to explicitly 
ask for it and provide the path to pmi.h


> On Dec 16, 2015, at 6:48 AM, Michael Di Domenico  
> wrote:
> 
> 
> i just compiled and installed Slurm 14.11.4 and Openmpi 1.10.0.  but i
> seem to have an srun oddity i've not seen before and i'm not exactly
> sure how to debug it
> 
> srun -n 4 hello_world
> - does not run, hangs in MPI_INIT
> 
> srun -n 4 -N1 hello_world
> - does not run, hangs in MPI_INIT
> 
> srun -n 4 -N 4
> - runs one task per node
> 
> sbatch and salloc seem to work okay launching using mpirun inside, and
> mpirun works without issue outside of slurm
> 
> i disabled all the gres and cgroup controls and all that
> 
> has anyone seen this before?

[slurm-dev] Re: Messages of errors

2015-11-30 Thread Ralph Castain

Ummm….didn’t we just address this, only with a slightly different subject line?


> On Nov 30, 2015, at 10:48 AM, Fany Pagés Díaz  wrote:
> 
> When I send a job whit  slurm I always have these messages when the job 
> finished, anyone know what this means?
>  
> [root@cluster bin]# salloc -n 3 -N 2 --exclusive --gres=gpu:2 mpirun mpiocl 
> salloc: Granted job allocation 133
>   We have 3 processors
>   Spawning from compute-0-0.local 
>   OpenCL MPI
>  
>   Probing nodes...
>  NodePsid   Cards (devID)
>  --- -  --
> Available platforms:
> platform 0: Intel(R) OpenCL
> platform 1: NVIDIA CUDA
> selected platform 1
> Nombre dispositivo: GeForce GTX 260 
> Nombre dispositivo: GeForce GTX 260 
> Available platforms:
> platform 0: Intel(R) OpenCL
> platform 1: NVIDIA CUDA
> selected platform 1
> Nombre dispositivo: GeForce GTX 260 
> Nombre dispositivo: GeForce GTX 260 
>  
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> salloc: Relinquishing job allocation 133
> salloc: Job allocation 133 has been revoked.
>  
> Thank you,
> Ing. Fany Pages Diaz

[slurm-dev] Re: Messages of mpirun noticed that the job aborted, but has no info as to the process

2015-11-30 Thread Ralph Castain

It means that at least one process exited with a non-zero status, indicating 
that a problem occurred in the application

> On Nov 30, 2015, at 9:39 AM, Fany Pagés Díaz  wrote:
> 
> When I send a job whit  slurm I always have these messages when the job 
> finished, anyone know what this means?
>  
> [root@cluster bin]# salloc -n 3 -N 2 --exclusive --gres=gpu:2 mpirun mpiocl 
> salloc: Granted job allocation 133
>   We have 3 processors
>   Spawning from compute-0-0.local 
>   OpenCL MPI
>  
>   Probing nodes...
>  NodePsid   Cards (devID)
>  --- -  --
> Available platforms:
> platform 0: Intel(R) OpenCL
> platform 1: NVIDIA CUDA
> selected platform 1
> Nombre dispositivo: GeForce GTX 260 
> Nombre dispositivo: GeForce GTX 260 
> Available platforms:
> platform 0: Intel(R) OpenCL
> platform 1: NVIDIA CUDA
> selected platform 1
> Nombre dispositivo: GeForce GTX 260 
> Nombre dispositivo: GeForce GTX 260 
>  
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> salloc: Relinquishing job allocation 133
> salloc: Job allocation 133 has been revoked.
>  
> Thank you,
> Ing. Fany Pages Diaz

[slurm-dev] PMIx Birds-of-a-Feather meeting at SC15

2015-11-21 Thread Ralph Castain

Hi folks

For those of you who have not yet become acquainted with the PMI-Exascale 
(PMIx) project, we encourage you to take a look at the project’s web site as 
you will hopefully find the launch performance and advanced features it 
provides of interest:

http://pmix.github.io/master/ 

Although I had announced the meeting here, I know many of you were unable to 
attend due to the unfortunate conflict in timing with the Slurm BoF. We had a 
very well-received meeting regarding PMIx at Supercomputing - you will find the 
slides here:

http://www.slideshare.net/rcastain/sc15-pmix-birdsofafeather 


As SchedMD has previously stated, Slulrm’s support for PMIx is planned to be 
released next May.

I also want to encourage everyone to please take the survey on 
malleable/evolving applications to help us design the APIs to support such 
models:

https://docs.google.com/forms/d/1bixKsm1379NH3Unp77vqIiE47GNnJcwKBaMSdNUew5s/viewform?ts=562fdbb7_requested=true
 


Thanks
Ralph

[slurm-dev] PMIx BoF at SC'15

2015-10-31 Thread Ralph Castain

Hello all

As some of you know, there will be a PMIx BoF at SC’15 this year.

It's at 12:15pm on Thurs, 19 Nov, 2015, in room 15:

http://sc15.supercomputing.org/schedule/event_detail?evid=bof122 


Unfortunately, that conflicts with the SLURM BoF, so we know that many of you 
won’t be able to attend. However, we'd like to capture some of your questions 
and input anyway.  Please submit any questions you have about PMIx, its 
roadmap, etc., on this quick form:

http://www.open-mpi.org/sc15/pmix/ 

Please feel free to share this note with anyone you feel might be interested - 
we welcome all input, and hope to see you there!

Ralph

[slurm-dev] Re: slurm-dev summary, was Re: What follows PMI-2?

2015-09-25 Thread Ralph Castain

Hi Andy

Let me see if I can clarify this for you and the others on this mailing list. 
The Ohio State group has been focused on improving the wireup algorithm in 
PMI-2, which focuses on the allgather operation. Hence their “ring” 
implementation.

PMIx has been focused on two goals:

1. high-speed “instant on” application startup that basically allows parallel 
programming models (MPI, OSHMEM, etc) to start/wireup as fast as the RM can 
start them

2. extending PMI to provide an enhanced application-RM partnership to support 
emerging programming paradigms where the application works in concert with the 
RM to steer execution. This takes many forms, including malleable workflow 
execution, file prepositioning, flexible allocations, and fault notifications.

We are accomplishing #1 by completely eliminating the wireup data exchange 
operation, and therefore the allgather disappears. Hence my comment that this 
question is moot. With the adoption of PMIx into essentially all RMs in use out 
there, we provide the infrastructure by which this is accomplished in a 
standard enough way that programming model implementations can reasonably count 
on it being there. We still have to maintain “backward compatibility” for those 
instances where we don’t have access to it, but we can treat those as 
lower-performance exception code paths.

OpenMPI will soon release the first use of that integration to achieve “instant 
on” behaviors. I can’t speak for the other MPI implementations, but history 
indicates that the community picks things up from each other rather quickly. So 
I expect that a year or so from now, the question of having a really good 
allgather will largely be moot. I’m not saying it will never be used, because 
we do have times when a “barrier” operation can be useful, and so PMIx will be 
adopting appropriate algorithms for such occasions (and OSU’s ring will surely 
be one of them). I’m just saying it won’t be in the critical startup path.

You can see more about the direction of PMIx, and how it is expected to be used 
in a broader vision of RM in general, in the following two presentations:

http://www.slideshare.net/rcastain/exascale-process-management-interface 
<http://www.slideshare.net/rcastain/exascale-process-management-interface>

http://www.slideshare.net/rcastain/hpc-controls-future 
<http://www.slideshare.net/rcastain/hpc-controls-future>

PMIx will be hosting a BoF at SC’15 in Austin, TX on Thurs, Nov 19th, at 
12:15-1:15pm to discuss the roadmap for this effort. Anyone interested in 
participating in the discussion, and contributing to the project(!), is welcome

HTH
Ralph

> On Sep 25, 2015, at 6:22 AM, Andy Riebs <andy.ri...@hpe.com> wrote:
> 
> Synthesizing what I think I've learned over the past 24 hours,
> The PMIx implementation described at 
> <http://slurm.schedmd.com/SLUG15/PMIx.pdf> 
> <http://slurm.schedmd.com/SLUG15/PMIx.pdf> describes a complete, 
> upward-compatible [I hope!] highly-scalable (exascale) replacement for the 
> PMI-1 and PMI-2 job launch facilities
> The "PMIX" paper at <http://slurm.schedmd.com/SLUG15/chakrabs-slug15.pdf> 
> <http://slurm.schedmd.com/SLUG15/chakrabs-slug15.pdf> describes creative ways 
> to use and extend the existing PMI-2 interface to exascale levels which will 
> [I hope!] be fully compatible with the PMIx implementation
> Clarifications and corrections gratefully accepted!
> Andy
> 
> On 09/24/2015 06:49 PM, Andy Riebs wrote:
>> 
>> Ralph, Artem, and Sourav, thanks for the explanation! 
>> 
>> Andy 
>> 
>> On 09/24/2015 04:52 PM, Ralph Castain wrote: 
>>> Hi Andy. 
>>> 
>>> I honestly have no idea why those guys did that :-). We’ve known about 
>>> their efforts for awhile, but this is the first I’ve seen them labeled as 
>>> PMIX. Guess we do differ on the capitalization of the last letter. 
>>> 
>>> Anyway, the PMIx effort is already being integrated in several popular RMs, 
>>> including Slurm, so I imagine it’s a moot point accept for possibly 
>>> confusing people searching publications. 
>>> 
>>> Ralph 
>>> 
>>>> On Sep 24, 2015, at 1:42 PM, Andy Riebs <andy.ri...@hpe.com> 
>>>> <mailto:andy.ri...@hpe.com> wrote: 
>>>> 
>>>> 
>>>>  From the presentations at the Slurm Users' Group meeting, 
>>>> <http://slurm.schedmd.com/SLUG15/chakrabs-slug15.pdf> 
>>>> <http://slurm.schedmd.com/SLUG15/chakrabs-slug15.pdf> and 
>>>> <http://slurm.schedmd.com/SLUG15/PMIx.pdf> 
>>>> <http://slurm.schedmd.com/SLUG15/PMIx.pdf>, it appears that two 
>>>> implementations of an improved PMI interface, both tagged "PMIX", are 
>>>> underway. Are these cooperating, competing, or "only just realized that 
>>>> they could be cooperating" activities? 
>>>> 
>>>> Andy 
>>>> 
>>>> -- 
>>>> Andy Riebs 
>>>> New email address! andy.ri...@hpe.com <mailto:andy.ri...@hpe.com> 
>>>> Hewlett-Packard Company 
>>>> High Performance Computing Software Engineering 
>>>> +1 404 648 9024 
>>>> My opinions are not necessarily those of HP 
>

[slurm-dev] Re: What follows PMI-2?

2015-09-24 Thread Ralph Castain

Hi Andy.

I honestly have no idea why those guys did that :-). We’ve known about their 
efforts for awhile, but this is the first I’ve seen them labeled as PMIX. Guess 
we do differ on the capitalization of the last letter.

Anyway, the PMIx effort is already being integrated in several popular RMs, 
including Slurm, so I imagine it’s a moot point accept for possibly confusing 
people searching publications.

Ralph

> On Sep 24, 2015, at 1:42 PM, Andy Riebs  wrote:
> 
> 
> From the presentations at the Slurm Users' Group meeting, 
>  and 
> , it appears that two 
> implementations of an improved PMI interface, both tagged "PMIX", are 
> underway. Are these cooperating, competing, or "only just realized that they 
> could be cooperating" activities?
> 
> Andy
> 
> -- 
> Andy Riebs
> New email address! andy.ri...@hpe.com
> Hewlett-Packard Company
> High Performance Computing Software Engineering
> +1 404 648 9024
> My opinions are not necessarily those of HP

[slurm-dev] Re: Large job socket timed out errors.

2015-09-21 Thread Ralph Castain


This sounds like something in Slurm - I don’t know how srun would know to emit 
a message if the app was failing to open a socket between its own procs.

Try starting the OMPI job with “mpirun” instead of srun and see if it has the 
same issue. If not, then that’s pretty convincing that it’s slurm.


> On Sep 21, 2015, at 7:26 PM, Timothy Brown  
> wrote:
> 
> 
> Hi Chris,
> 
> 
>> On Sep 21, 2015, at 7:36 PM, Christopher Samuel  
>> wrote:
>> 
>> 
>> On 22/09/15 07:17, Timothy Brown wrote:
>> 
>>> This is using mpiexec.hydra with slurm as the bootstrap. 
>> 
>> Have you tried Intel MPI's native PMI start up mode?
>> 
>> You just need to set the environment variable I_MPI_PMI_LIBRARY to the
>> path to the Slurm libpmi.so file and then you should be able to use srun
>> to launch your job instead.
>> 
> 
> Yeap, to the same effect. Here's what it gives:
> 
> srun --mpi=pmi2 
> /lustre/janus_scratch/tibr1099/osu_impi/libexec/osu-micro-benchmarks//mpi/collective/osu_alltoall
> srun: error: Task launch for 973564.0 failed on node node0453: Socket timed 
> out on send/recv operation
> srun: error: Application launch failed: Socket timed out on send/recv 
> operation
> 
> 
> 
>> More here:
>> 
>> http://slurm.schedmd.com/mpi_guide.html#intel_srun
>> 
>>> If I switch to OpenMPI the error is:
>> 
>> Which version, and was it build with --with-slurm and (if you're
>> version is not too ancient) --with-pmi=/path/to/slurm/install ?
> 
> Yeap. 1.8.5 (for 1.10 we're going to try and move everything to EasyBuild). 
> Yes we included PMI and the Slurm option. Our configure statement was:
> 
> module purge
> module load slurm/slurm
> module load gcc/5.1.0
> ./configure  \
>  --prefix=/curc/tools/x86_64/rh6/software/openmpi/1.8.5/gcc/5.1.0 \
>  --with-threads=posix \
>  --enable-mpi-thread-multiple \
>  --with-slurm \
>  --with-pmi=/curc/slurm/slurm/current/ \
>  --enable-static \
>  --enable-wrapper-rpath \
>  --enable-sensors \
>  --enable-mpi-ext=all \
>  --with-verbs
> 
> It's got me scratching my head, as I started off thinking it was an MPI 
> issue, spent awhile getting Intel's hydra and OpenMPI's oob to go over IB 
> instead of gig-e. This increased the success rate, but we were still failing.
> 
> Tried out a pure PMI (version 1) code (init, rank, size, fini), which worked 
> a lot of the times. Which made me think it was MPI again! However that fails 
> enough to say it's not MPI. The PMI v2 code I wrote, gives the wrong results 
> for rank and world size, so I'm sweeping that under the rug until I 
> understand it!
> 
> Just wondering if anybody has seen anything like this. Am happy to share our 
> conf file if that helps.
> 
> The only other thing I could possibly point a finger at (but don't believe it 
> is), is that the slurm masters (slurmctld) are only on gig-E.
> 
> I'm half thinking of opening a TT, but was hoping to get more information 
> (and possibly not increase the logging of slurm, which is my only next idea).
> 
> Thanks for your thoughts Chris.
> 
> Timothy=

[slurm-dev] Re: Compile of 15.08.0 fails on trusty missing mpio.h

2015-09-12 Thread Ralph Castain


I think what we are saying is that HDF5 appears to be violating the MPI 
standard - they are not supposed to be directly including the mpio.h header. 
Someone might want to discuss that with them :-)


> On Sep 12, 2015, at 8:13 PM, Christopher Samuel  wrote:
> 
> 
> On 12/09/15 18:56, dani wrote:
> 
>> You shouldn't need openmpi preinstalled with recent slurm.
>> Slurm build pmi2, and then when building openmpi you supply --with-pmi
>> and point it to the pmi dir created by slurm
> 
> But Michael appears to want HDF5 support in Slurm and it is that HDF5
> code that seems to need these MPI headers.
> 
> All the best,
> Chris (finally in DC for the Slurm User Group)
> -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci

[slurm-dev] Re: Compile of 15.08.0 fails on trusty missing mpio.h

2015-09-11 Thread Ralph Castain

Hmmm…the OMPI folks point out that directly including mpio.h doesn’t conform to 
the standard, which is why we don’t separately install it.


> On Sep 11, 2015, at 9:58 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> I’ve asked the OMPI team if we can/should have a separate mpio.h in our 
> install - will let you know as soon as I hear back.
> 
> Ralph
> 
>> On Sep 11, 2015, at 8:45 AM, Michael Gutteridge 
>> <michael.gutteri...@gmail.com <mailto:michael.gutteri...@gmail.com>> wrote:
>> 
>> Thanks for all the advice- I followed Jared's suggestion- a basic .c that 
>> includes the hdf5 headers.  That and the HDF examples all compiled and ran 
>> fine.
>> 
>> Until, that is, I had 'src/common' in the include path.  Building my 
>> examples in the Slurm build tree using an include path with src/common and 
>> the slurm source top level fails with the same error.
>> 
>> What I see as possibly problematic is 'src/common/mpi.h'.  The HDF include 
>> file ('/usr/include/H5public.h') has a check that handles the MPICH vs. 
>> OpenMPI differences:
>> 
>> 
>>  60 #ifdef H5_HAVE_PARALLEL
>>  61 #   define MPICH_SKIP_MPICXX 1
>>  62 #   define OMPI_SKIP_MPICXX 1
>>  63 #   include 
>>  64 #ifndef MPI_FILE_NULL   /*MPIO may be defined in mpi.h already   
>> */
>>  65 #   include 
>>  66 #endif
>>  67 #endif
>> 
>> 
>> It appears that OpenMPI has MPIO defined in mpi.h, whereas MPICH defines 
>> that stuff in mpio.h.  As I've got a parallel build, I suspect that the make 
>> includes src/common/mpi.h, doesn't find "MPI_FILE_NULL" defined and attempts 
>> to include mpio.h.
>> 
>> I'm not the strongest C user, so I could be wrong on that, though when I 
>> remove "src/common" from the include path for libsh5util_old it does compile 
>> successfully.
>> 
>> I think worst case I'll end up with bootstrapping as Paul has indicated 
>> (thanks for the procedure, BTW).
>> 
>> Thanks for all your time
>> 
>> M
>> 
>> 
>> 
>> On Fri, Sep 11, 2015 at 7:30 AM, Van Der Mark, Paul <pvanderm...@fsu.edu 
>> <mailto:pvanderm...@fsu.edu>> wrote:
>> I'm glad we were not the only one with that problem. I basically did a
>> bootstrap compilation
>> 1. compile openmpi without slurm
>> 2. compile hdf5 with that openmpi version
>> 3. compile slurm with that hpdf5 version
>> 4. recompile openmpi with slurm
>> 5. for safety recompile hdf5 & slurm
>> 
>> One of the reasons for that slightly confusing setup is because our
>> semi-automated building system compiles 3 different versions of openmpi
>> and compiles hdf5 for serial and all those openmpi versions at once.
>> 
>> Beet,
>> Paul
>> 
>> On Thu, 2015-09-10 at 14:13 -0700, Jared David Baker wrote:
>> > Hello Michael,
>> >
>> >
>> >
>> > I had this problem the other day when I was building Slurm on Arch. I
>> > had hdf-mpi package installed with OpenMPI and there was an
>> > inconsistency there as well. Basically the HDF5 implementation was
>> > built with mpio in mind, but the compiler invocation does not find an
>> > mpio.h as part of the OpenMPI installation. I guess my base point is I
>> > have a different HDF5 implementation on our HPC clusters to satisfy
>> > the Slurm requirements separate from the MPI enabled versions. Anyway,
>> > I’d be suspicious of the OpenMPI/HDF5 packages you’ve installed in
>> > Ubuntu and possibly open a bug if the HDF5 uses OpenMPI as the MPI
>> > implementation and doesn’t contain the mpio.h file. Can you compile a
>> > simple code with your MPI compiler wrapper that includes mpio.h?
>> >
>> >
>> >
>> > Best,
>> >
>> >
>> >
>> > Jared
>> >
>> >
>> >
>> > From: Michael Gutteridge [mailto:michael.gutteri...@gmail.com 
>> > <mailto:michael.gutteri...@gmail.com>]
>> > Sent: Thursday, September 10, 2015 12:41 PM
>> > To: slurm-dev
>> > Subject: [slurm-dev] Compile of 15.08.0 fails on trusty missing mpio.h
>> >
>> >
>> >
>> >
>> > Hi
>> >
>> >
>> >
>> >
>> >
>> > This really feels like an obvious one, but I'm having a devil of a
>> > time sorting out how to address this. I'm building Slurm 15.08.0 on
>> > Ubuntu 14.04 LTS with a minimal set of configure options:
>> >
>> >
>> >
>&

[slurm-dev] Re: Compile of 15.08.0 fails on trusty missing mpio.h

2015-09-11 Thread Ralph Castain

I’ve asked the OMPI team if we can/should have a separate mpio.h in our install 
- will let you know as soon as I hear back.

Ralph

> On Sep 11, 2015, at 8:45 AM, Michael Gutteridge 
>  wrote:
> 
> Thanks for all the advice- I followed Jared's suggestion- a basic .c that 
> includes the hdf5 headers.  That and the HDF examples all compiled and ran 
> fine.
> 
> Until, that is, I had 'src/common' in the include path.  Building my examples 
> in the Slurm build tree using an include path with src/common and the slurm 
> source top level fails with the same error.
> 
> What I see as possibly problematic is 'src/common/mpi.h'.  The HDF include 
> file ('/usr/include/H5public.h') has a check that handles the MPICH vs. 
> OpenMPI differences:
> 
> 
>  60 #ifdef H5_HAVE_PARALLEL
>  61 #   define MPICH_SKIP_MPICXX 1
>  62 #   define OMPI_SKIP_MPICXX 1
>  63 #   include 
>  64 #ifndef MPI_FILE_NULL   /*MPIO may be defined in mpi.h already   
> */
>  65 #   include 
>  66 #endif
>  67 #endif
> 
> 
> It appears that OpenMPI has MPIO defined in mpi.h, whereas MPICH defines that 
> stuff in mpio.h.  As I've got a parallel build, I suspect that the make 
> includes src/common/mpi.h, doesn't find "MPI_FILE_NULL" defined and attempts 
> to include mpio.h.
> 
> I'm not the strongest C user, so I could be wrong on that, though when I 
> remove "src/common" from the include path for libsh5util_old it does compile 
> successfully.
> 
> I think worst case I'll end up with bootstrapping as Paul has indicated 
> (thanks for the procedure, BTW).
> 
> Thanks for all your time
> 
> M
> 
> 
> 
> On Fri, Sep 11, 2015 at 7:30 AM, Van Der Mark, Paul  > wrote:
> I'm glad we were not the only one with that problem. I basically did a
> bootstrap compilation
> 1. compile openmpi without slurm
> 2. compile hdf5 with that openmpi version
> 3. compile slurm with that hpdf5 version
> 4. recompile openmpi with slurm
> 5. for safety recompile hdf5 & slurm
> 
> One of the reasons for that slightly confusing setup is because our
> semi-automated building system compiles 3 different versions of openmpi
> and compiles hdf5 for serial and all those openmpi versions at once.
> 
> Beet,
> Paul
> 
> On Thu, 2015-09-10 at 14:13 -0700, Jared David Baker wrote:
> > Hello Michael,
> >
> >
> >
> > I had this problem the other day when I was building Slurm on Arch. I
> > had hdf-mpi package installed with OpenMPI and there was an
> > inconsistency there as well. Basically the HDF5 implementation was
> > built with mpio in mind, but the compiler invocation does not find an
> > mpio.h as part of the OpenMPI installation. I guess my base point is I
> > have a different HDF5 implementation on our HPC clusters to satisfy
> > the Slurm requirements separate from the MPI enabled versions. Anyway,
> > I’d be suspicious of the OpenMPI/HDF5 packages you’ve installed in
> > Ubuntu and possibly open a bug if the HDF5 uses OpenMPI as the MPI
> > implementation and doesn’t contain the mpio.h file. Can you compile a
> > simple code with your MPI compiler wrapper that includes mpio.h?
> >
> >
> >
> > Best,
> >
> >
> >
> > Jared
> >
> >
> >
> > From: Michael Gutteridge [mailto:michael.gutteri...@gmail.com 
> > ]
> > Sent: Thursday, September 10, 2015 12:41 PM
> > To: slurm-dev
> > Subject: [slurm-dev] Compile of 15.08.0 fails on trusty missing mpio.h
> >
> >
> >
> >
> > Hi
> >
> >
> >
> >
> >
> > This really feels like an obvious one, but I'm having a devil of a
> > time sorting out how to address this. I'm building Slurm 15.08.0 on
> > Ubuntu 14.04 LTS with a minimal set of configure options:
> >
> >
> >
> >
> >
> > --sysconfdir=/etc/slurm-llnl --localstatedir=/var/run/slurm-llnl
> > --with-munge --without-blcr --enable-pam --without-rpath
> > --disable-debug
> >
> >
> >
> >
> >
> > Configure succeeds OK, but make fails a little while in during the
> > build of the HDF5 components:
> >
> >
> >
> >
> >
> > Making all in libsh5util_old
> >
> >
> > make[8]: Entering directory
> > `/home/build/trusty/slurm-llnl/15.08.0/build/slurm-15.08.0/obj-x86_64-linux-gnu/src/plugins/acct_gather_profile/hdf5/sh5util/libsh5util_old'
> >
> >
> > /bin/bash ../../../../../../libtool  --tag=CC   --mode=compile gcc
> > -DHAVE_CONFIG_H -I.
> > -I../../../../../../../src/plugins/acct_gather_profile/hdf5/sh5util/libsh5util_old
> >  -I../../../../../.. -I../../../../../../slurm  -I../../../../../../.. 
> > -I../../../../../../../src/common -I. -I/usr/lib/openmpi/include -I 
> > /usr/include  -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat 
> > -Werror=format-security -pthread -fno-gcse -c -o sh5util.lo 
> > ../../../../../../../src/plugins/acct_gather_profile/hdf5/sh5util/libsh5util_old/sh5util.c
> >
> >
> > libtool: compile:  gcc -DHAVE_CONFIG_H -I.
> > -I../../../../../../../src/plugins/acct_gather_profile/hdf5/sh5util/libsh5util_old
> >  -I../../../../../..

[slurm-dev] Re: PMI2 in Slurm 14.11.8 ?

2015-09-01 Thread Ralph Castain


Might be my bad, Chris - it was my understanding that PMI2 support was to be 
installed by default in Slurm releases post 14.03.

If not, I’ll change the FAQ.

> On Sep 1, 2015, at 7:56 PM, Christopher Samuel  wrote:
> 
> 
> Hi all,
> 
> We're bringing up a new cluster with Slurm 14.11.8 and in building
> OpenMPI 1.10 for it I've noticed that the FAQ for OpenMPI with PMI2
> says:
> 
> # Yes, if you have configured OMPI --with-pmi=foo, where foo is
> # the path to the directory where pmi.h/pmi2.h is located.
> # Slurm (> 2.6, > 14.03) installs PMI-2 support by default.
> 
> and the Open-MPI README file says:
> 
> --with-pmi
> Build PMI support (by default, it is not built). If the pmi2.h
> header is found in addition to pmi.h, then support for PMI2 will be
> built.
> 
> However, that doesn't seem to be true for Slurm 14.11.8, there appears
> to be no pmi2.h installed by "make install" and indeed there is a
> contrib/pmi2 directory instead that has the pmi2.h header file and API
> which appears to need manual intervention to install.
> 
> Has there been some misunderstanding about PMI2 in Slurm?
> 
> All the best,
> Chris
> -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci

[slurm-dev] Re: PMI2 in Slurm 14.11.8 ?

2015-09-01 Thread Ralph Castain


Danny, Moe, etal: can you confirm that pmi2 is intentionally -not- being 
installed by default?


> On Sep 1, 2015, at 8:18 PM, Christopher Samuel <sam...@unimelb.edu.au> wrote:
> 
> 
> Hi Ralph,
> 
> On 02/09/15 13:10, Ralph Castain wrote:
> 
>> Might be my bad, Chris - it was my understanding that PMI2 support
>> was to be installed by default in Slurm releases post 14.03.
> 
> Well to be fair to you srun does list it as an option even though it
> doesn't appear to be installed..
> 
> [samuel@snowy-m ~]$ srun --mpi=list
> srun: MPI types are...
> [...]
> srun: mpi/pmi2
> srun: mpi/openmpi
> [...]
> 
> All the best,
> Chris
> -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci

[slurm-dev] Re: PMI2 in Slurm 14.11.8 ?

2015-09-01 Thread Ralph Castain

Thanks! I will clarify in our documentation

Sorry for the confusion, Chris

> On Sep 1, 2015, at 8:32 PM, Danny Auble <d...@schedmd.com> wrote:
> 
> I'm fairly sure if you install via rpm it will be there. Contribs isn't build 
> through the normal make as was pointed out, but it is through the rpm 
> process. 
> 
> On September 1, 2015 8:24:21 PM PDT, Ralph Castain <r...@open-mpi.org> wrote:
> 
> Danny, Moe, etal: can you confirm that pmi2 is intentionally -not- being 
> installed by default?
> 
> 
>  On Sep 1, 2015, at 8:18 PM, Christopher Samuel <sam...@unimelb.edu.au> wrote:
>  
>  
>  Hi Ralph,
>  
>  On 02/09/15 13:10, Ralph Castain wrote:
>  
>  Might be my bad, Chris - it was my understanding that PMI2 support
>  was to be installed by default in Slurm releases post 14.03.
>  
>  Well to be fair to you srun does list it as an option even though it
>  doesn't appear to be installed..
>  
>  [samuel@snowy-m ~]$ srun --mpi=list
>  srun: MPI types are...
>  [...]
>  srun: mpi/pmi2
>  srun: mpi/openmpi
>  [...]
>  
>  All the
> best,
>  Chris
>  -- 
>  Christopher SamuelSenior Systems Administrator
>  VLSCI - Victorian Life Sciences Computation Initiative
>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.org.au <http://www.vlsci.org.au/>/  
> http://twitter.com/vlsci <http://twitter.com/vlsci>

[slurm-dev] Re: MPI-OpenMP jobs on SLURM fail ORTE_ERROR_LOG: Not found in file ess_slurmd_module.c

2015-08-27 Thread Ralph Castain

Hmmm….I haven’t seen someone using OMPI 1.6.0 in a very long time. Please note 
that the latest OMPI release is now 1.10.0, so your installation is rather far 
behind.

At the very least, I would start by updating OMPI to the 1.8.8 or 1.10.0 level. 
You will then find that the SLURM integration has improved quite a bit, and you 
no longer need to use the —resv-ports option. OMPI will run with the standard 
PMI library.

You will also find that mpirun will respect the SLURM-assigned task affinity.

You may also want to update SLURM, but I leave that to others to advise - the 
OMPI change by itself should resolve the problem.


 On Aug 27, 2015, at 2:26 AM, Turner, Andrew andrew.tur...@ccfe.ac.uk wrote:
 
 Dear all
  
 We are running SLURM version 2.3 and openmpi 1.6.0.
 In order to run openmpi jobs and inherit the correct task affinity from 
 SLURM, jobs are executed with 'srun --resv-ports ./the_job' under sbatch (or 
 salloc).
  
 Pure mpi tasks with --cpus-per-task=1 run fine.
 The issue is when attempting a hybrid mpi-omp task, with --cpus-per-task  1, 
  the job fails when using 'srun --resv-ports'.
 Many error messages are printed, along the lines of
 ' ORTE_ERROR_LOG: Not found in file ess_slurmd_module.c at line 504'
  
 I am not the administrator of the cluster, only a user, but I was hoping we 
 might be able to point the administrators in a useful direction to solve the 
 issue.
 Is this a known issue?  E.g. due to some incompatibility between this SLURM 
 version and the OpenMPI we have installed?  Would updating SLURM and/or 
 OpenMPI solve this issue?  Or could it be a configuration issue that is 
 easily fixed?  (see config file below)
  
 As a side issue, maybe related, we find that
 - We can run multiple threads per task if we execute using mpirun (e.g. 
 mpirun -bind-to-socket -bysocket), but mpirun does not know anything about 
 what cores it has been allocated, so it only works with exclusive node 
 option.  On shared nodes it will often crash.
 - We don’t use mpirun for pure MPI jobs since we find tasks do not have the 
 correct task affinity/binding (in this case, no binding).  Hence we use 
 ‘srun’ since nodes are shared.
 - With srun we must use ‘--resv-ports’.  Without resv-ports results in the 
 error message:
   orte_grpcomm_modex failed
   -- Returned A message is attempting to be sent to a process whose contact 
 information is unknown (-117) instead of Success (0)
  
 Hopefully someone can advise how we can make it work for multiple threaded 
 jobs?  Thanks in advance.
  
 Andy
  
  
 Andrew Turner
 Culham Centre for Fusion Energy
 Culham Science Centre
 Abingdon
 Oxfordshire
 OX14 3DB
  
 www.ccfe.ac.uk http://www.ccfe.ac.uk/
  
 Our slurm.conf file
  
 ClusterName=erik
 ControlMachine=erik000
 BackupController=erik001
 SlurmUser=slurm
 SlurmctldPort=6817
 SlurmdPort=6818
 AuthType=auth/munge
 StateSaveLocation=/home/sysadmin/SlurmState
 SlurmdSpoolDir=/tmp/slurmd
 SwitchType=switch/none
 MpiDefault=none
 SlurmctldPidFile=/var/run/slurmctld.pid
 SlurmdPidFile=/var/run/slurmd.pid
 Proctracktype=proctrack/linuxproc
 CacheGroups=0
 ReturnToService=1
 TaskPlugin=task/affinity
 # TIMERS
 SlurmctldTimeout=300
 SlurmdTimeout=300
 InactiveLimit=0
 MinJobAge=300
 KillWait=30
 Waittime=0
 #
 # SCHEDULING
 SchedulerType=sched/wiki
 SchedulerPort=7321
 SelectType=select/cons_res
 FastSchedule=1
 # LOGGING
 SlurmctldDebug=3
 SlurmctldLogFile=/var/log/slurmctld.log
 SlurmdDebug=3
 SlurmdLogFile=/var/log/slurmd.log
 JobCompType=jobcomp/filetxt
 JobCompLoc=/var/slurm/accounting
 #
 # ACCOUNTING
 JobAcctGatherType=jobacct_gather/linux
 JobAcctGatherFrequency=30
 #
 AccountingStorageType=accounting_storage/filetxt
 #
 # MPI
 MpiParams=ports=12000-12999
 #
 # COMPUTE NODES
 NodeName=erik000 Procs=16 Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 
 State=UNKNOWN
 NodeName=DEFAULT Procs=16 Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 
 RealMemory=129009 State=UNKNOWN
 NodeName=erik[001-044]
 PartitionName=erik Nodes=erik[001-044] Default=YES MaxTime=INFINITE State=UP

[slurm-dev] Re: ntasks-per-node makes openmpi fail

2015-08-25 Thread Ralph Castain


I’m assuming you are executing the jobs as “srun —ntasks-per-node N ../my_app” 
as opposed to launching them via mpirun? If so, then be aware that the OMPI 1.6 
series isn’t all that well integrated with slurm for that use-case. You might 
want to upgrade to OMPI 1.8.8 instead.


 On Aug 25, 2015, at 2:17 AM, Sefa Arslan sefa.ars...@tubitak.gov.tr wrote:
 
 
 Hi,
 
 In order to diistribute mpi processes uniformly (in terms of number of cores) 
  over the nodes,  our users uses --ntasks-per-node parameter in their jobs.. 
 Processes are distributed uniformly, but openmpi jobs fails.. When users use 
 -n Instead of --ntasks-per-node, jobs are run properly but the distribution 
 of cores are not uniform in this case.
 
 We use openmpi-1.6.5 and slurm 13.12.0.. Is there a know bug with this 
 releases with this versions or probably there is a misconfiguration / 
 compilation?
 
 Thanks a lot..
 
 Sefa Arslan

[slurm-dev] Re: srun + openmpi : Missing locality information

2015-06-22 Thread Ralph Castain

Sounds odd - I suspect there is some issue with the IB network, then, as we
regularly test against IB and have seen no problems. I'd suggest switching
this thread to the OMPI user mailing list, and provide the usual requested
info for these problems: your configure cmd line, config.log, and output
from ompi_info.

We'll get it figured out :-)
Ralph


On Mon, Jun 22, 2015 at 8:58 AM, Wiegand, Paul wieg...@ist.ucf.edu wrote:

 I upgraded to OpenMPI 1.8.6 last week, and this did change how the problem
 presents but did not solve our problems.  Now MPI indicates that it cannot
 use the BTL OpenIB and so runs, but without using the IB.  I also tried
 building with the --without-scif switch as suggested earlier last week,
 with no help.  I've not had time to dig in since then.

 Still no luck on our end.

 Paul.

  On Jun 22, 2015, at 11:45, Ralph Castain r...@open-mpi.org wrote:
 
  Good to hear! Thanks
  Ralph
 
 
  On Mon, Jun 22, 2015 at 7:50 AM, Paul van der Mark pvanderm...@fsu.edu
 wrote:
 
  Hello Ralph,
 
  A quick update: our upgrade to OpenMPI 1.8.6 (from 1.8.5) seems to have
  done the trick. Since we only recently starting to prepare for a switch
  to slurm, I can't confirm if this already existed in 1.8.4. Our slurm
  version is 14.11.7.
 
  Best,
  Paul
 
 
  On 06/18/2015 10:45 AM, Ralph Castain wrote:
  
   Please let us know - FWIW, we aren’t seeing any such reports on the
 OMPI mailing lists, and we run our test harness against Slurm (and other
 RMs) every night.
  
   Also, please tell us what version of Slurm you are using. We do
 sometimes see regressions against newer versions as they appear, and that
 may be the case here.
  
  
   On Jun 18, 2015, at 7:32 AM, Paul van der Mark pvanderm...@fsu.edu
 wrote:
  
  
   Hello John,
  
   We tried a number of combination of flags and some work and some
 don't.
   1. salloc -N 3 --ntasks-per-node 3 srun ./mympiprog
   2. salloc -n 9 srun ./mympiprog
   (test cluster with 8 cores per node)
  
   Case 1: works flawless (for every combination)
   Case 2: works sometimes, warnings in some cases, segmentation faults
 in
   some cases (for example -n 10) in opal_memory_ptmalloc2_int_malloc.
  
   mpirun instead of srun works all the time.
  
   We are going to look into openmpi 1.8.6 now. We would like to have -n
 X
   work, since that is what most of our users use anyway.
  
   Best,
   Paul
  
  
  
  
   On 06/05/2015 08:19 AM, John Desantis wrote:
  
   Paul,
  
   How are you invoking srun with the application in question?
  
   It seems strange that the messages would be manifest when the job
 runs
   on more than one node.  Have you tried passing the flags -N and
   --ntasks-per-node for testing?  What about using -w hostfile?
   Those would be the options that I'd immediately try to begin
   trouble-shooting the issue.
  
   John DeSantis
  
   2015-06-02 14:19 GMT-04:00 Paul van der Mark pvanderm...@fsu.edu:
  
   All,
  
   We are preparing for a switch from our current job scheduler to
 slurm
   and I am running into a strange issue. I compiled openmpi with slurm
   support and when I start a job with sbatch and use mpirun everything
   works fine. However, when I use srun instead of mpirun and the job
 does
   not fit on a single node, I either receive the following openmpi
 warning
   a number of times:
  
 --
   WARNING: Missing locality information required for sm
 initialization.
   Continuing without shared memory support.
  
 --
   or a segmentation fault in an openmpi library (address not mapped)
 or
   both.
  
   I only observe this with mpi-programs compiled with openmpi and ran
 by
   srun when the job does not fit on a single node. The same program
   started by openmpi's mpirun runs fine. The same source compiled with
   mvapich2 works fine with srun.
  
   Some version info:
   slurm 14.11.7
   openmpi 1.8.5
   hwloc 1.10.1 (used for both slurm and openmpi)
   os: RHEL 7.1
  
   Has anyone seen that warning before and what would be a good place
 to
   start troubleshooting?
  
  
   Thank you,
   Paul

[slurm-dev] Re: srun + openmpi : Missing locality information

2015-06-22 Thread Ralph Castain

Good to hear! Thanks
Ralph


On Mon, Jun 22, 2015 at 7:50 AM, Paul van der Mark pvanderm...@fsu.edu
wrote:


 Hello Ralph,

 A quick update: our upgrade to OpenMPI 1.8.6 (from 1.8.5) seems to have
 done the trick. Since we only recently starting to prepare for a switch
 to slurm, I can't confirm if this already existed in 1.8.4. Our slurm
 version is 14.11.7.

 Best,
 Paul


 On 06/18/2015 10:45 AM, Ralph Castain wrote:
 
  Please let us know - FWIW, we aren’t seeing any such reports on the OMPI
 mailing lists, and we run our test harness against Slurm (and other RMs)
 every night.
 
  Also, please tell us what version of Slurm you are using. We do
 sometimes see regressions against newer versions as they appear, and that
 may be the case here.
 
 
  On Jun 18, 2015, at 7:32 AM, Paul van der Mark pvanderm...@fsu.edu
 wrote:
 
 
  Hello John,
 
  We tried a number of combination of flags and some work and some don't.
  1. salloc -N 3 --ntasks-per-node 3 srun ./mympiprog
  2. salloc -n 9 srun ./mympiprog
  (test cluster with 8 cores per node)
 
  Case 1: works flawless (for every combination)
  Case 2: works sometimes, warnings in some cases, segmentation faults in
  some cases (for example -n 10) in opal_memory_ptmalloc2_int_malloc.
 
  mpirun instead of srun works all the time.
 
  We are going to look into openmpi 1.8.6 now. We would like to have -n X
  work, since that is what most of our users use anyway.
 
  Best,
  Paul
 
 
 
 
  On 06/05/2015 08:19 AM, John Desantis wrote:
 
  Paul,
 
  How are you invoking srun with the application in question?
 
  It seems strange that the messages would be manifest when the job runs
  on more than one node.  Have you tried passing the flags -N and
  --ntasks-per-node for testing?  What about using -w hostfile?
  Those would be the options that I'd immediately try to begin
  trouble-shooting the issue.
 
  John DeSantis
 
  2015-06-02 14:19 GMT-04:00 Paul van der Mark pvanderm...@fsu.edu:
 
  All,
 
  We are preparing for a switch from our current job scheduler to slurm
  and I am running into a strange issue. I compiled openmpi with slurm
  support and when I start a job with sbatch and use mpirun everything
  works fine. However, when I use srun instead of mpirun and the job
 does
  not fit on a single node, I either receive the following openmpi
 warning
  a number of times:
 
 --
  WARNING: Missing locality information required for sm initialization.
  Continuing without shared memory support.
 
 --
  or a segmentation fault in an openmpi library (address not mapped) or
  both.
 
  I only observe this with mpi-programs compiled with openmpi and ran by
  srun when the job does not fit on a single node. The same program
  started by openmpi's mpirun runs fine. The same source compiled with
  mvapich2 works fine with srun.
 
  Some version info:
  slurm 14.11.7
  openmpi 1.8.5
  hwloc 1.10.1 (used for both slurm and openmpi)
  os: RHEL 7.1
 
  Has anyone seen that warning before and what would be a good place to
  start troubleshooting?
 
 
  Thank you,
  Paul

[slurm-dev] Re: srun + openmpi : Missing locality information

2015-06-22 Thread Ralph Castain

Yeah, your follow-on note indicates this really is a Slurm issue and not an
OMPI one, so I'd keep the conversation here. I'm afraid I don't know why
Slurm isn't sourcing those files - have to defer to others.



On Mon, Jun 22, 2015 at 9:17 AM, Wiegand, Paul wieg...@ist.ucf.edu wrote:

 I will do that if that is the general feeling here; however, we run
 perfectly fine on this network over Torque/Moab and without a resource
 manager at all.  The only time we see these problems is when we run under
 slurm.

 Paul.

  On Jun 22, 2015, at 12:08, Ralph Castain r...@open-mpi.org wrote:
 
  Sounds odd - I suspect there is some issue with the IB network, then, as
 we regularly test against IB and have seen no problems. I'd suggest
 switching this thread to the OMPI user mailing list, and provide the usual
 requested info for these problems: your configure cmd line, config.log, and
 output from ompi_info.
 
  We'll get it figured out :-)
  Ralph
 
 
  On Mon, Jun 22, 2015 at 8:58 AM, Wiegand, Paul wieg...@ist.ucf.edu
 wrote:
  I upgraded to OpenMPI 1.8.6 last week, and this did change how the
 problem presents but did not solve our problems.  Now MPI indicates that it
 cannot use the BTL OpenIB and so runs, but without using the IB.  I also
 tried building with the --without-scif switch as suggested earlier last
 week, with no help.  I've not had time to dig in since then.
 
  Still no luck on our end.
 
  Paul.
 
   On Jun 22, 2015, at 11:45, Ralph Castain r...@open-mpi.org wrote:
  
   Good to hear! Thanks
   Ralph
  
  
   On Mon, Jun 22, 2015 at 7:50 AM, Paul van der Mark 
 pvanderm...@fsu.edu wrote:
  
   Hello Ralph,
  
   A quick update: our upgrade to OpenMPI 1.8.6 (from 1.8.5) seems to have
   done the trick. Since we only recently starting to prepare for a switch
   to slurm, I can't confirm if this already existed in 1.8.4. Our slurm
   version is 14.11.7.
  
   Best,
   Paul
  
  
   On 06/18/2015 10:45 AM, Ralph Castain wrote:
   
Please let us know - FWIW, we aren’t seeing any such reports on the
 OMPI mailing lists, and we run our test harness against Slurm (and other
 RMs) every night.
   
Also, please tell us what version of Slurm you are using. We do
 sometimes see regressions against newer versions as they appear, and that
 may be the case here.
   
   
On Jun 18, 2015, at 7:32 AM, Paul van der Mark pvanderm...@fsu.edu
 wrote:
   
   
Hello John,
   
We tried a number of combination of flags and some work and some
 don't.
1. salloc -N 3 --ntasks-per-node 3 srun ./mympiprog
2. salloc -n 9 srun ./mympiprog
(test cluster with 8 cores per node)
   
Case 1: works flawless (for every combination)
Case 2: works sometimes, warnings in some cases, segmentation
 faults in
some cases (for example -n 10) in opal_memory_ptmalloc2_int_malloc.
   
mpirun instead of srun works all the time.
   
We are going to look into openmpi 1.8.6 now. We would like to have
 -n X
work, since that is what most of our users use anyway.
   
Best,
Paul
   
   
   
   
On 06/05/2015 08:19 AM, John Desantis wrote:
   
Paul,
   
How are you invoking srun with the application in question?
   
It seems strange that the messages would be manifest when the job
 runs
on more than one node.  Have you tried passing the flags -N and
--ntasks-per-node for testing?  What about using -w hostfile?
Those would be the options that I'd immediately try to begin
trouble-shooting the issue.
   
John DeSantis
   
2015-06-02 14:19 GMT-04:00 Paul van der Mark pvanderm...@fsu.edu
 :
   
All,
   
We are preparing for a switch from our current job scheduler to
 slurm
and I am running into a strange issue. I compiled openmpi with
 slurm
support and when I start a job with sbatch and use mpirun
 everything
works fine. However, when I use srun instead of mpirun and the
 job does
not fit on a single node, I either receive the following openmpi
 warning
a number of times:
   
 --
WARNING: Missing locality information required for sm
 initialization.
Continuing without shared memory support.
   
 --
or a segmentation fault in an openmpi library (address not
 mapped) or
both.
   
I only observe this with mpi-programs compiled with openmpi and
 ran by
srun when the job does not fit on a single node. The same program
started by openmpi's mpirun runs fine. The same source compiled
 with
mvapich2 works fine with srun.
   
Some version info:
slurm 14.11.7
openmpi 1.8.5
hwloc 1.10.1 (used for both slurm and openmpi)
os: RHEL 7.1
   
Has anyone seen that warning before and what would be a good
 place to
start troubleshooting?
   
   
Thank you,
Paul

[slurm-dev] Re: srun + openmpi : Missing locality information

2015-06-05 Thread Ralph Castain


I’ll take a look...

 On Jun 5, 2015, at 5:19 AM, John Desantis desan...@mail.usf.edu wrote:
 
 
 Paul,
 
 How are you invoking srun with the application in question?
 
 It seems strange that the messages would be manifest when the job runs
 on more than one node.  Have you tried passing the flags -N and
 --ntasks-per-node for testing?  What about using -w hostfile?
 Those would be the options that I'd immediately try to begin
 trouble-shooting the issue.
 
 John DeSantis
 
 2015-06-02 14:19 GMT-04:00 Paul van der Mark pvanderm...@fsu.edu:
 
 All,
 
 We are preparing for a switch from our current job scheduler to slurm
 and I am running into a strange issue. I compiled openmpi with slurm
 support and when I start a job with sbatch and use mpirun everything
 works fine. However, when I use srun instead of mpirun and the job does
 not fit on a single node, I either receive the following openmpi warning
 a number of times:
 --
 WARNING: Missing locality information required for sm initialization.
 Continuing without shared memory support.
 --
 or a segmentation fault in an openmpi library (address not mapped) or
 both.
 
 I only observe this with mpi-programs compiled with openmpi and ran by
 srun when the job does not fit on a single node. The same program
 started by openmpi's mpirun runs fine. The same source compiled with
 mvapich2 works fine with srun.
 
 Some version info:
 slurm 14.11.7
 openmpi 1.8.5
 hwloc 1.10.1 (used for both slurm and openmpi)
 os: RHEL 7.1
 
 Has anyone seen that warning before and what would be a good place to
 start troubleshooting?
 
 
 Thank you,
 Paul

[slurm-dev] Re: srun + openmpi : Missing locality information

2015-06-05 Thread Ralph Castain

Can you try the latest 1.8 nightly tarball? I’m not able to replicate the 
problem with it, and so I think it might have been fixed. IIRC, there was an 
error with the now-deprecated “sm” BTL.

http://www.open-mpi.org/nightly/v1.8/ http://www.open-mpi.org/nightly/v1.8/

I’ll be releasing 1.8.6 in the near future


 On Jun 5, 2015, at 6:21 AM, Ralph Castain r...@open-mpi.org wrote:
 
 
 I’ll take a look...
 
 On Jun 5, 2015, at 5:19 AM, John Desantis desan...@mail.usf.edu wrote:
 
 
 Paul,
 
 How are you invoking srun with the application in question?
 
 It seems strange that the messages would be manifest when the job runs
 on more than one node.  Have you tried passing the flags -N and
 --ntasks-per-node for testing?  What about using -w hostfile?
 Those would be the options that I'd immediately try to begin
 trouble-shooting the issue.
 
 John DeSantis
 
 2015-06-02 14:19 GMT-04:00 Paul van der Mark pvanderm...@fsu.edu:
 
 All,
 
 We are preparing for a switch from our current job scheduler to slurm
 and I am running into a strange issue. I compiled openmpi with slurm
 support and when I start a job with sbatch and use mpirun everything
 works fine. However, when I use srun instead of mpirun and the job does
 not fit on a single node, I either receive the following openmpi warning
 a number of times:
 --
 WARNING: Missing locality information required for sm initialization.
 Continuing without shared memory support.
 --
 or a segmentation fault in an openmpi library (address not mapped) or
 both.
 
 I only observe this with mpi-programs compiled with openmpi and ran by
 srun when the job does not fit on a single node. The same program
 started by openmpi's mpirun runs fine. The same source compiled with
 mvapich2 works fine with srun.
 
 Some version info:
 slurm 14.11.7
 openmpi 1.8.5
 hwloc 1.10.1 (used for both slurm and openmpi)
 os: RHEL 7.1
 
 Has anyone seen that warning before and what would be a good place to
 start troubleshooting?
 
 
 Thank you,
 Paul

[slurm-dev] Re: Slurm and docker/containers

2015-05-31 Thread Ralph Castain

I sympathize with the problem. In addition, although I am not a lawyer, it is 
my understanding that Docker’s license is incompatible with Slurm’s GPL, and 
thus you cannot distribute such an integration.

FWIW, I’m just starting on my 2nd “pre-retirement” project (PMIx being the 
first and ongoing one) to build an open source HPC container (under the 
3-clause BSD license) that will run at user level, provide bare-metal (QoS 
managed) access to the OS-bypass fabric, provide direct injection of user 
applications, and function ship access to the file system. I expect to setup a 
public Github for it in the next week or so, and hopefully have at least a 
start in time for SC15.

Anyone interested can drop me a line off-list (rhc at open-mpi.org 
http://open-mpi.org/) and I’ll notify you when I get things setup. I’m more 
than happy to have other interested parties collaborate on it!


 On May 31, 2015, at 5:27 PM, Christopher Samuel sam...@unimelb.edu.au wrote:
 
 
 On 21/05/15 00:38, Michael Jennings wrote:
 
 At the risk of further putting words in Chris' mouth (which I risk
 doing only because I know he'll forgive me if I get it wrong, and it
 will help him out if I get it right), I'll say what the two of us are
 asking for is if anyone has a working implementation of running jobs
 under SLURM which execute inside a Docker container (or similar
 container technology), and if so, how you wound up choosing to do it!
 :-)
 
 Sorry for being absent for a while after starting this thread, pressures
 of work.
 
 Michael hit the nail on the head for me there.
 
 The security side of things is an issue, though I'm not sure how much
 the fact that the program is running in a separate UID namespace helps,
 presumably if you've got to give it HPC filesystem access then the
 answer is probably not at all.
 
 One of my concerns has always been that as these images age without
 updates then their exposure to known security bugs increases.
 
 That seems to be born out by this recent survey:
 
 http://www.banyanops.com/blog/analyzing-docker-hub/
 
 # Over 30% of Official Images in Docker Hub Contain High Priority
 # Security Vulnerabilities
 #
 # [...] Surprisingly, we found that more than 30% of images in
 # official repositories are highly susceptible to a variety of
 # security attacks (e.g., Shellshock, Heartbleed, Poodle, etc.).
 # For general images – images pushed by docker users, but not
 # explicitly verified by any authority – this number jumps up
 # to ~40% with a sampling error bound of 3%. [...]
 
 If anything that puts me off liking them even more. :-(
 
 All the best,
 Chris
 -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-04-16 Thread Ralph Castain

No, you shouldn't have to do so - it's a dynamic library that gets picked
up at execution


On Thu, Apr 16, 2015 at 2:55 AM, Bjørn-Helge Mevik b.h.me...@usit.uio.no
wrote:


 We are considering compiling openmpi with --with-pmi=/opt/slurm to
 enable running mpi jobs with srun.

 If we do this, will we have to recompile openmpi and/or programs built
 with openmpi when we upgrade slurm? (If so, only for major upgrades, or
 for minor upgrades as well?)

 --
 Regards,
 Bjørn-Helge Mevik, dr. scient,
 Department for Research Computing, University of Oslo

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-04-16 Thread Ralph Castain

Hmmm...yeah, it sounds like Slurm changed it's library names and/or
dependencies. I'm afraid that you do indeed need to recompile OMPI in that
case. You probably need to rerun configure as well, just to be safe.

Sorry - outside OMPI's control :-/


On Thu, Apr 16, 2015 at 5:22 AM, Uwe Sauter uwe.sauter...@gmail.com wrote:


 Hi,

 I have the case that OpenMPI was built against Slurm 14.03 (which provided
 libslurm.so.27). Since upgrading to 14.11 I get errors
 like:

 [controller:35605] mca: base: component_find: unable to open
 /opt/apps/openmpi/1.8.1/gcc/4.9/0/lib/openmpi/mca_ess_pmi:
 libslurm.so.27: cannot open shared object file: No such file or directory
 (ignored)

 because now Slurm provides libslurmdb.so.28 .

 I believe the only way to resolve this is to recompile OpenMPI… correct?


 Regards,

 Uwe


 Am 16.04.2015 um 13:18 schrieb Ralph Castain:
  No, you shouldn't have to do so - it's a dynamic library that gets
 picked up at execution
 
 
  On Thu, Apr 16, 2015 at 2:55 AM, Bjørn-Helge Mevik 
 b.h.me...@usit.uio.no mailto:b.h.me...@usit.uio.no wrote:
 
 
  We are considering compiling openmpi with --with-pmi=/opt/slurm to
  enable running mpi jobs with srun.
 
  If we do this, will we have to recompile openmpi and/or programs
 built
  with openmpi when we upgrade slurm? (If so, only for major upgrades,
 or
  for minor upgrades as well?)
 
  --
  Regards,
  Bjørn-Helge Mevik, dr. scient,
  Department for Research Computing, University of Oslo

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-04-16 Thread Ralph Castain

To be clear, we aren't linking to libslurm at all. The issue is that libpmi
is linking to it, and we link to libpmi. So I think you have to recompile
to get the link dependencies correctly setup.

On Thu, Apr 16, 2015 at 5:32 AM, Uwe Sauter uwe.sauter...@gmail.com wrote:


 Hi Ralph,

 beside the mentioned libslurm.so.28 there is also a libslurm.so pointing
 to the same libslurm.so.28.0.0 file. Perhaps OpenMPI
 could use this link instead of the versioned on?

 File list in slurm/lib directory:

 -rw-r--r-- 1 slurm slurm   68992 Mar 20 11:39 libpmi.a
 -rwxr-xr-x 1 slurm slurm1016 Mar 20 11:39 libpmi.la
 lrwxrwxrwx 1 slurm slurm  15 Mar 20 11:39 libpmi.so - libpmi.so.0.0.0
 lrwxrwxrwx 1 slurm slurm  15 Mar 20 11:39 libpmi.so.0 -
 libpmi.so.0.0.0
 -rwxr-xr-x 1 slurm slurm   52800 Mar 20 11:39 libpmi.so.0.0.0
 -rw-r--r-- 1 slurm slurm 8099794 Mar 20 11:39 libslurm.a
 -rw-r--r-- 1 slurm slurm 8348210 Mar 20 11:39 libslurmdb.a
 -rwxr-xr-x 1 slurm slurm1006 Mar 20 11:39 libslurmdb.la
 lrwxrwxrwx 1 slurm slurm  20 Mar 20 11:39 libslurmdb.so -
 libslurmdb.so.28.0.0
 lrwxrwxrwx 1 slurm slurm  20 Mar 20 11:39 libslurmdb.so.28 -
 libslurmdb.so.28.0.0
 -rwxr-xr-x 1 slurm slurm 4115144 Mar 20 11:39 libslurmdb.so.28.0.0
 -rwxr-xr-x 1 slurm slurm 992 Mar 20 11:39 libslurm.la
 lrwxrwxrwx 1 slurm slurm  18 Mar 20 11:39 libslurm.so -
 libslurm.so.28.0.0
 lrwxrwxrwx 1 slurm slurm  18 Mar 20 11:39 libslurm.so.28 -
 libslurm.so.28.0.0
 -rwxr-xr-x 1 slurm slurm 4012214 Mar 20 11:39 libslurm.so.28.0.0
 drwxr-xr-x 2 slurm slurm4096 Mar 20 11:40 pam
 drwxr-xr-x 3 slurm slurm   12288 Mar 20 11:40 slurm


 Regards,

 Uwe


 Am 16.04.2015 um 13:27 schrieb Ralph Castain:
  Hmmm...yeah, it sounds like Slurm changed it's library names and/or
 dependencies. I'm afraid that you do indeed need to recompile
  OMPI in that case. You probably need to rerun configure as well, just to
 be safe.
 
  Sorry - outside OMPI's control :-/
 
 
  On Thu, Apr 16, 2015 at 5:22 AM, Uwe Sauter uwe.sauter...@gmail.com
 mailto:uwe.sauter...@gmail.com wrote:
 
 
  Hi,
 
  I have the case that OpenMPI was built against Slurm 14.03 (which
 provided libslurm.so.27). Since upgrading to 14.11 I get errors
  like:
 
  [controller:35605] mca: base: component_find: unable to open
  /opt/apps/openmpi/1.8.1/gcc/4.9/0/lib/openmpi/mca_ess_pmi:
  libslurm.so.27: cannot open shared object file: No such file or
 directory (ignored)
 
  because now Slurm provides libslurmdb.so.28 .
 
  I believe the only way to resolve this is to recompile OpenMPI…
 correct?
 
 
  Regards,
 
  Uwe
 
 
  Am 16.04.2015 um 13:18 schrieb Ralph Castain:
   No, you shouldn't have to do so - it's a dynamic library that gets
 picked up at execution
  
  
   On Thu, Apr 16, 2015 at 2:55 AM, Bjørn-Helge Mevik 
 b.h.me...@usit.uio.no mailto:b.h.me...@usit.uio.no
  mailto:b.h.me...@usit.uio.no mailto:b.h.me...@usit.uio.no
 wrote:
  
  
   We are considering compiling openmpi with
 --with-pmi=/opt/slurm to
   enable running mpi jobs with srun.
  
   If we do this, will we have to recompile openmpi and/or
 programs built
   with openmpi when we upgrade slurm? (If so, only for major
 upgrades, or
   for minor upgrades as well?)
  
   --
   Regards,
   Bjørn-Helge Mevik, dr. scient,
   Department for Research Computing, University of Oslo

[slurm-dev] Re: OpenMPI and 14.3 to 14.11 upgrade

2015-02-06 Thread Ralph Castain


Glad to hear you found/fixed the problem!

 On Feb 6, 2015, at 2:44 PM, Peter A Ruprecht peter.rupre...@colorado.edu 
 wrote:
 
 
 Thanks to everyone who responded.
 
 It appears that the issue was not the version of Slurm, but rather that we
 had set TaskAffinity=yes in cgroups.conf at the same time we installed the
 new version.
 
 Applications that were using OpenMPI version 1.6 and prior were in many
 cases showing dramatically slower run times.  I incorrectly wrote earlier
 that v1.8 was also affected; in fact it seems to have been OK.
 
 I don't have a good environment for testing this further at the moment,
 unfortunately, but since we backed out the change the users are happy
 again.
 
 Thanks again,
 Peter
 
 On 2/6/15, 6:49 AM, Ralph Castain r...@open-mpi.org wrote:
 
 
 If you are launching via mpirun, then you won't be using either version
 of PMI - OMPI has its own internal daemons that handle the launch and
 wireup.
 
 It's odd that it happens across OMPI versions as there exist significant
 differences between them. Is the speed difference associated with non-MPI
 jobs as well? In other words, if you execute mpirun hostname, does it
 also take an inordinate amount of time?
 
 If not, then the other possibility is that you are falling back on TCP
 instead of IB, or that something is preventing the use of shared memory
 as a transport for procs on the same node.
 
 
 On Feb 5, 2015, at 5:02 PM, Peter A Ruprecht
 peter.rupre...@colorado.edu wrote:
 
 
 Answering two questions at one time:
 
 I am pretty sure we are not using PMI2.
 
 Jobs are launched via sbatch job_script where the script contains
 mpirun ./executable_file.  There appear to be issues with at least
 OMPI
 1.6.4 and 1.8.X.
 
 Thanks
 Peter
 
 On 2/5/15, 5:39 PM, Ralph Castain r...@open-mpi.org wrote:
 
 
 And are you launching via mpirun or directly with srun myapp? What
 OMPI
 version are you using?
 
 
 On Feb 5, 2015, at 3:32 PM, Chris Samuel sam...@unimelb.edu.au
 wrote:
 
 
 On Thu, 5 Feb 2015 03:27:25 PM Peter A Ruprecht wrote:
 
 I ask because some of our users have started reporting a 10x increase
 in
 run-times of OpenMPI jobs since we upgraded to 14.11.3 from 14.3.
 It's
 possible there is some other problem going on in our cluster, but all
 of
 our hardware checks including Infiniband diagnostics look pretty
 clean.
 
 Are you using PMI2?
 
 cheers,
 Chris
 -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

[slurm-dev] Re: OpenMPI and 14.3 to 14.11 upgrade

2015-02-06 Thread Ralph Castain


If you are launching via mpirun, then you won't be using either version of PMI 
- OMPI has its own internal daemons that handle the launch and wireup.

It's odd that it happens across OMPI versions as there exist significant 
differences between them. Is the speed difference associated with non-MPI jobs 
as well? In other words, if you execute mpirun hostname, does it also take an 
inordinate amount of time?

If not, then the other possibility is that you are falling back on TCP instead 
of IB, or that something is preventing the use of shared memory as a transport 
for procs on the same node.


 On Feb 5, 2015, at 5:02 PM, Peter A Ruprecht peter.rupre...@colorado.edu 
 wrote:
 
 
 Answering two questions at one time:
 
 I am pretty sure we are not using PMI2.
 
 Jobs are launched via sbatch job_script where the script contains
 mpirun ./executable_file.  There appear to be issues with at least OMPI
 1.6.4 and 1.8.X.
 
 Thanks
 Peter
 
 On 2/5/15, 5:39 PM, Ralph Castain r...@open-mpi.org wrote:
 
 
 And are you launching via mpirun or directly with srun myapp? What OMPI
 version are you using?
 
 
 On Feb 5, 2015, at 3:32 PM, Chris Samuel sam...@unimelb.edu.au wrote:
 
 
 On Thu, 5 Feb 2015 03:27:25 PM Peter A Ruprecht wrote:
 
 I ask because some of our users have started reporting a 10x increase
 in
 run-times of OpenMPI jobs since we upgraded to 14.11.3 from 14.3.  It's
 possible there is some other problem going on in our cluster, but all
 of
 our hardware checks including Infiniband diagnostics look pretty clean.
 
 Are you using PMI2?
 
 cheers,
 Chris
 -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

[slurm-dev] Re: OpenMPI and 14.3 to 14.11 upgrade

2015-02-05 Thread Ralph Castain


And are you launching via mpirun or directly with srun myapp? What OMPI 
version are you using?


 On Feb 5, 2015, at 3:32 PM, Chris Samuel sam...@unimelb.edu.au wrote:
 
 
 On Thu, 5 Feb 2015 03:27:25 PM Peter A Ruprecht wrote:
 
 I ask because some of our users have started reporting a 10x increase in
 run-times of OpenMPI jobs since we upgraded to 14.11.3 from 14.3.  It's
 possible there is some other problem going on in our cluster, but all of
 our hardware checks including Infiniband diagnostics look pretty clean.
 
 Are you using PMI2?
 
 cheers,
 Chris
 -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

[slurm-dev] Re: OpenMPI, mpirun and suspend/gang

2014-11-09 Thread Ralph Castain


What stop signal is being sent, and where? We will catch and suspend the job on 
receipt of a SIGTSTP signal by mpirun.


 On Nov 9, 2014, at 6:47 AM, Jason Bacon jwba...@tds.net wrote:
 
 
 
 Does anyone have SUSPEND,GANG working with openmpi via mpirun?
 
 I've set up a low-priority queue, which seems to be working, except that for 
 openmpi jobs, only the processes on the MPI root node seem to be getting the 
 stop signal.
 
 From slurm.conf:
 
 SelectType=select/cons_res
 SelectTypeParameters=CR_Core_Memory
 PreemptMode=SUSPEND,GANG
 PreemptType=preempt/partition_prio
 
 MpiDefault=none
 
 I've also tried --mca orte_forward_job_control 1, but it had no apparent 
 effect.
 
 Thanks,
 
   Jason

[slurm-dev] Re: Programmatically submit a job

2014-11-07 Thread Ralph Castain

Just be sure you understand that by using slurm.h and linking against libslurm,
your application will become GPL. IANAL, so you should check with one, if you
care.

On Nov 7, 2014, at 11:14 AM, José Román Bilbao Castro jrbc...@idiria.com
wrote:

Going even further, I have realized you are following an outdated
documentation (12 releases behind from current one which is 14!!).

Here is the current API:

http://slurm.schedmd.com/api.html http://slurm.schedmd.com/api.html

Have a look at the Resource allocation section, not the job steps one:

slurm_init_job_desc_msg—Initialize the data structure used in resource
allocation requests. You can then just set the fields of particular interest
and let the others use default values.
slurm_job_will_run—Determine if a job would be immediately initiated if
submitted now.
slurm_allocate_resources—Allocate resources for a job. Response message must
be freed using slurm_free_resource_allocation_response_msg to avoid a memory
leak.
slurm_free_resource_allocation_response_msg— Frees memory allocated by
slurm_allocate_resources.
slurm_allocate_resources_and_run—Allocate resources for a job and spawn a job
step. Response message must be freed
usingslurm_free_resource_allocation_and_run_response_msg to avoid a memory
leak.
slurm_free_resource_allocation_and_run_response_msg— Frees memory allocated
by slurm_allocate_resources_and_run.
slurm_submit_batch_job—Submit a script for later execution. Response message
must be freed using slurm_free_submit_response_response_msg to avoid a memory
leak.
slurm_free_submit_response_response_msg— Frees memory allocated by
slurm_submit_batch_job.
slurm_confirm_allocation—Test if a resource allocation has already been made
for a given job id. Response message must be freed
usingslurm_free_resource_allocation_response_msg to avoid a memory leak. This
can be used to confirm that an allocation is still active or for error
recovery
The steps one is to manage the steps concept I suppose. I mean, once you
submit a job it follows a series of steps that change its status. I think
this is the concept of step here. Steps should not be created nor modified by
the user but by the scheduler itself. So it is up to the programmer to submit
a job and query steps if needed, but not to modify them...

By the way, I have never programmed using slurm but I think I could be
correct ;). If not, I will be delighted to get responses because I will be
using job submission API in a few days!

Bests

Enviado desde mi iPad

El 7/11/2014, a las 20:02, José Román Bilbao Castro jrbc...@idiria.com
mailto:jrbc...@idiria.com escribió:

I have read a bit further. I think this can be a misunderstanding of
documentation. It does not say you cannot submit jobs but that you
should use this API to create and populate new jobs information. There
is a structure that defines a job but it shouldn't be manipulated
directly but through the API. Once the job is defined, you can submit
it using a different API:

http://slurm.schedmd.com/launch_plugins.html
http://slurm.schedmd.com/launch_plugins.html

Have a look at the last sentence on the first section. It states to
have a look at the src/plugins/launch/slurm/launch_slurm.c file.

Also, for a broader picture, have a look at this page on the Developers
section:

http://slurm.schedmd.com/documentation.html
http://slurm.schedmd.com/documentation.html

Specially on the Design subsection where the process of creating and
submitting a job is further described (it consists of multiple steps
and APIs).

Hope this helps

Enviado desde mi iPad

El 7/11/2014, a las 16:08, Всеволод Никоноров vs.nikono...@yandex.ru
mailto:vs.nikono...@yandex.ru escribió:

Hello Walter,

Maybe you could just use system function?

#include cstdlib
#include cstdio
int main (int argc, char** argv)
{
printf (begin\n);
system (sleep 1);
printf (end\n);
}

Place you sbatch call instead my sleep 1, shouldn't this do what you want?

06.11.2014, 09:49, Walter Landry wlan...@caltech.edu
mailto:wlan...@caltech.edu:
Hello Everyone,

What is the recommended way for a C++ program to submit a job? О©╫The
API documentation says

О©╫О©╫SLURM job steps involve numerous interactions with the slurmd
О©╫О©╫daemon. The job step creation is only the first step in the
О©╫О©╫process. We don't advise direct user creation of job steps, but
О©╫О©╫include the information here for completeness.

Should I use system(srun)? О©╫DRMAA?

Thank you,
Walter Landry
wlan...@caltech.edu mailto:wlan...@caltech.edu

--О©╫
Vsevolod Nikonorov, JSC NIKIET
О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫ О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫, О©╫О©╫О©╫
О©╫О©╫О©╫О©╫О©╫О©╫

[slurm-dev] Re: Programmatically submit a job

2014-11-07 Thread Ralph Castain

Just a general caution to the thread as people were discussing how to use a 
programmatic API to launch a job. I don’t know of any 3rd party software that 
directly links against slurm - we all use fork/exec or system calls to srun to 
avoid the licensing issue.


 On Nov 7, 2014, at 1:47 PM, José Román Bilbao Castro jrbc...@idiria.com 
 wrote:
 
 Hi Ralph,
 
 Is this one for me, the original poster or for both of us?. Anyway, just out 
 of curiosity (as I am not familiar with licensing issues), how do third party 
 software that uses SLURM does to make you pay so much money in licenses while 
 not breaking GPL rules?, do they use system calls?. I never liked systems 
 calls... For example EngineFrame...
 
 Thanks in advance
 
 Enviado desde mi iPad
 
 El 7/11/2014, a las 22:02, Ralph Castain r...@open-mpi.org 
 mailto:r...@open-mpi.org escribió:
 
 Just be sure you understand that by using slurm.h and linking against 
 libslurm, your application will become GPL. IANAL, so you should check with 
 one, if you care.
 
 
 
 On Nov 7, 2014, at 11:14 AM, José Román Bilbao Castro jrbc...@idiria.com 
 mailto:jrbc...@idiria.com wrote:
 
 Going even further, I have realized you are following an outdated 
 documentation (12 releases behind from current one which is 14!!).
 
 Here is the current API:
 
 http://slurm.schedmd.com/api.html http://slurm.schedmd.com/api.html
 
 Have a look at the Resource allocation section, not the job steps one:
 
 slurm_init_job_desc_msg—Initialize the data structure used in resource 
 allocation requests. You can then just set the fields of particular 
 interest and let the others use default values.
 slurm_job_will_run—Determine if a job would be immediately initiated if 
 submitted now.
 slurm_allocate_resources—Allocate resources for a job. Response message 
 must be freed using slurm_free_resource_allocation_response_msg to avoid a 
 memory leak.
 slurm_free_resource_allocation_response_msg— Frees memory allocated by 
 slurm_allocate_resources.
 slurm_allocate_resources_and_run—Allocate resources for a job and spawn a 
 job step. Response message must be freed 
 usingslurm_free_resource_allocation_and_run_response_msg to avoid a memory 
 leak.
 slurm_free_resource_allocation_and_run_response_msg— Frees memory allocated 
 by slurm_allocate_resources_and_run.
 slurm_submit_batch_job—Submit a script for later execution. Response 
 message must be freed using slurm_free_submit_response_response_msg to 
 avoid a memory leak.
 slurm_free_submit_response_response_msg— Frees memory allocated by 
 slurm_submit_batch_job.
 slurm_confirm_allocation—Test if a resource allocation has already been 
 made for a given job id. Response message must be freed 
 usingslurm_free_resource_allocation_response_msg to avoid a memory leak. 
 This can be used to confirm that an allocation is still active or for error 
 recovery
 The steps one is to manage the steps concept I suppose. I mean, once you 
 submit a job it follows a series of steps that change its status. I think 
 this is the concept of step here. Steps should not be created nor modified 
 by the user but by the scheduler itself. So it is up to the programmer to 
 submit a job and query steps if needed, but not to modify them...
 
 By the way, I have never programmed using slurm but I think I could be 
 correct ;). If not, I will be delighted to get responses because I will be 
 using job submission API in a few days!
 
 Bests
 
 
 
 
 Enviado desde mi iPad
 
 El 7/11/2014, a las 20:02, José Román Bilbao Castro jrbc...@idiria.com 
 mailto:jrbc...@idiria.com escribió:
 
 
 I have read a bit further. I think this can be a misunderstanding of
 documentation. It does not say you cannot submit jobs but that you
 should use this API to create and populate new jobs information. There
 is a structure that defines a job but it shouldn't be manipulated
 directly but through the API. Once the job is defined, you can submit
 it using a different API:
 
 http://slurm.schedmd.com/launch_plugins.html 
 http://slurm.schedmd.com/launch_plugins.html
 
 Have a look at the last sentence on the first section. It states to
 have a look at the src/plugins/launch/slurm/launch_slurm.c  file.
 
 Also, for a broader picture, have a look at this page on the Developers 
 section:
 
 http://slurm.schedmd.com/documentation.html 
 http://slurm.schedmd.com/documentation.html
 
 Specially on the Design subsection where the process of creating and
 submitting a job is further described (it consists of multiple steps
 and APIs).
 
 Hope this helps
 
 Enviado desde mi iPad
 
 El 7/11/2014, a las 16:08, Всеволод Никоноров vs.nikono...@yandex.ru 
 mailto:vs.nikono...@yandex.ru escribió:
 
 
 Hello Walter,
 
 Maybe you could just use system function?
 
 #include cstdlib
 #include cstdio
 int main (int argc, char** argv)
 {
   printf (begin\n);
   system (sleep 1);
   printf (end\n);
 }
 
 Place you sbatch call instead my sleep 1, shouldn't this do what you 
 want?
 
 06.11.2014, 09

[slurm-dev] Re: Failure tolerance in slurm + openmpi

2014-10-29 Thread Ralph Castain

FWIW: that isn't a Slurm issue. OMPI's mpirun will kill the job if any MPI 
process abnormally terminates unless told otherwise. See the mpirun man page 
for options on how to run without termination.


 On Oct 29, 2014, at 12:34 AM, Artem Polyakov artpo...@gmail.com wrote:
 
 Hello, Steven.
 
 As one of the opportunities you can give a DMTCP project 
 (http://dmtcp.sourceforge.net/ http://dmtcp.sourceforge.net/) a try. I was 
 the one who add SLURM support there and it is relatively stable now (still 
 under development though). Let me know if you'll have any problems.
 
 2014-10-29 13:10 GMT+06:00 Steven Chow wulingaoshou_...@163.com 
 mailto:wulingaoshou_...@163.com:
 Hi,
 I am a newer on slurm. 
 I have a problem about the Failure Tolerance, when I was running a MPI 
 application on a cluster with slurm. 
 
 My slurm version is 14.03.6, and the MPI version is OPEN MPI  1.6.5.
 I didn't use plugin Checkpoint or Nonstop.
 
 I submit the job through command salloc -N 10 --no-kill  mpirun 
 ./my-mpi-application.
 
 In the running process, if one node crashed, then the WHOLE job would be 
 killd on all allocated nodes.
 It seems that the --no-kill option dosen't work.
 
  I want the job continuing running without being killed, even with some nodes 
 failure or network connection broken. 
 Because  i will handle the nodes failure by myself.
 
 Can anyone give some suggestions.
 
 Besides, if I want to use plugin  Nonstop to handle failure, according to 
 http://slurm.schedmd.com/nonstop.html, 
 http://slurm.schedmd.com/nonstop.html,   an additional package named smd 
 will also need to be installed. 
 How can I get this package?
 
 Thanks!
 
 -Steven Chow
 
 
  
 
 
 
 -- 
 С Уважением, Поляков Артем Юрьевич
 Best regards, Artem Y. Polyakov

[slurm-dev] Re: Failure tolerance in slurm + openmpi

2014-10-29 Thread Ralph Castain


 On Oct 29, 2014, at 3:07 AM, Yann Sagon ysa...@gmail.com wrote:
 
 
 
 2014-10-29 8:11 GMT+01:00 Steven Chow wulingaoshou_...@163.com 
 mailto:wulingaoshou_...@163.com:
 
 
 I submit the job through command salloc -N 10 --no-kill  mpirun 
 ./my-mpi-application.
 
 Hello, you are not supposed to use mpirun with slurm but directly srun (or 
 something similar). 
 

That simply isn’t true - there is no problem using mpirun with Slurm, and many 
people do so because they want the options offered by mpirun.

[slurm-dev] Re: Failure tolerance in slurm + openmpi

2014-10-29 Thread Ralph Castain

Hmm…perhaps I misunderstood. I thought your application was an MPI application, 
in which case OMPI would definitely abort the job if something fails. So I’m 
puzzled by your observation as (a) I wrote the logic that responds to that 
problem, and (b) I verified that it does indeed behave as designed.

Are you setting an MCA param to tell it not to terminate upon bad proc exit?


 On Oct 29, 2014, at 3:55 AM, Steven Chow wulingaoshou_...@163.com wrote:
 
 
 If  I run my mpi_application without using slurm, for example, I use mpirun 
 --machinefile host_list --pernode  ./my_mpi_application,
 then in the running  process , a node crashes,  it turns out that the 
 application still works.
 
 If I used
 salloc -N 10 --no-kill  mpirun ./my-mpi-application
 
 
 when the whole job being killed, I got the slurmctld.logs as following:
slurmctld.log:
[2014-10-29T09:20:58.150] sched: _slurm_rpc_allocate_resources JobId=6 
 NodeList=vsc-00-03-[00-01],vsc-server-204 usec=8221
[2014-10-29T09:21:08.103] sched: _slurm_rpc_job_step_create: 
 StepId=6.0 vsc-00-03-[00-01] usec=5004
   [2014-10-29T09:23:05.066] sched: Cancel of StepId=6.0 by UID=0 
 usec=5054
   [2014-10-29T09:23:05.084] sched: Cancel of StepId=6.0 by UID=0 
 usec=3904
   [2014-10-29T09:23:07.005] sched: Cancel of StepId=6.0 by UID=0 
 usec=5114 
 And ,slurmd.log as following:
   [2014-10-29T09:21:13.000] launch task 6.0 request from 
 0.0@10.0.3.204 mailto:0.0@10.0.3.204 (port 60118)
   [2014-10-29T09:21:13.054] Received cpu frequency information for 2 
 cpus
   [2014-10-29T09:23:09.958] [6.0] *** STEP 6.0 KILLED AT 
 2014-10-29T09:23:09 WITH SIGNAL 9 ***
   [2014-10-29T09:23:09.976] [6.0] *** STEP 6.0 KILLED AT 
 2014-10-29T09:23:09 WITH SIGNAL 9 ***
   [2014-10-29T09:23:11.897] [6.0] *** STEP 6.0 KILLED AT 
 2014-10-29T09:23:11 WITH SIGNAL 9 ***
 
 
 So I get the conclusion that it is slurm who kills the mpi job.
 
 
 
 
 At 2014-10-29 17:40:43, Ralph Castain r...@open-mpi.org 
 mailto:r...@open-mpi.org wrote:
 FWIW: that isn't a Slurm issue. OMPI's mpirun will kill the job if any MPI 
 process abnormally terminates unless told otherwise. See the mpirun man page 
 for options on how to run without termination.
 
 
 On Oct 29, 2014, at 12:34 AM, Artem Polyakov artpo...@gmail.com 
 mailto:artpo...@gmail.com wrote:
 
 Hello, Steven.
 
 As one of the opportunities you can give a DMTCP project 
 (http://dmtcp.sourceforge.net/ http://dmtcp.sourceforge.net/) a try. I was 
 the one who add SLURM support there and it is relatively stable now (still 
 under development though). Let me know if you'll have any problems.
 
 2014-10-29 13:10 GMT+06:00 Steven Chow wulingaoshou_...@163.com 
 mailto:wulingaoshou_...@163.com:
 Hi,
 I am a newer on slurm. 
 I have a problem about the Failure Tolerance, when I was running a MPI 
 application on a cluster with slurm. 
 
 My slurm version is 14.03.6, and the MPI version is OPEN MPI  1.6.5.
 I didn't use plugin Checkpoint or Nonstop.
 
 I submit the job through command salloc -N 10 --no-kill  mpirun 
 ./my-mpi-application.
 
 In the running process, if one node crashed, then the WHOLE job would be 
 killd on all allocated nodes.
 It seems that the --no-kill option dosen't work.
 
  I want the job continuing running without being killed, even with some 
 nodes failure or network connection broken. 
 Because  i will handle the nodes failure by myself.
 
 Can anyone give some suggestions.
 
 Besides, if I want to use plugin  Nonstop to handle failure, according to 
 http://slurm.schedmd.com/nonstop.html, 
 http://slurm.schedmd.com/nonstop.html,   an additional package named smd 
 will also need to be installed. 
 How can I get this package?
 
 Thanks!
 
 -Steven Chow
 
 
 
 
 
 
 -- 
 С Уважением, Поляков Артем Юрьевич
 Best regards, Artem Y. Polyakov

[slurm-dev] Re: pmi and hwloc

2014-07-13 Thread Ralph Castain

Just to clarify something: this only occurs when --with-pmi is provided. We 
*never* link directly against libslurm for licensing reasons, and --with-slurm 
doesn't cause us to link against any Slurm libraries.

So the only impact here is that we would have to drop support for directly 
launching apps using srun, and require the use of mpirun instead. Regrettable, 
but my point is to clarify that this doesn't preclude use of OMPI under Slurm 
environments.

Obviously, we would prefer to see it resolved, and that libpmi stand alone as 
an LGPL library :-)  This goes beyond what Mike is requesting, which is to at 
least remove the hwloc dependency as PMI clearly doesn't require it.


On Jul 13, 2014, at 4:24 AM, Mike Dubman mi...@mellanox.com wrote:

  
 Hi guys,
  
 The new SLURM 14.x series contains “–lhwloc” dependency mentioned in the 
 dependency_libs= string, in the slurm provided .la files:
  
 libpmi.la
 libslurmdb.la
 libslurm.la
  
 This breaks OMPI compilation when either –with-pmi or –with-slurm flags 
 provided to OMPI “configure”.
  
 I checked previous SLURM 2.6.x version and it does not have such dependency 
 for hwloc.
  
 http://www.open-mpi.org/community/lists/devel/2014/07/15130.php
  
 Please fix.
 Thanks
  
  
 Kind Regards,
  
 Mike Dubman | RD Senior Director, HPC
 Tel:  +972 (74) 712 9214 | Fax: +972 (74) 712 9111
 Mellanox Ltd. 13 Zarchin St., Bldg B, Raanana 43662, Israel

[slurm-dev] Re: SLURM_JOB_NAME not set for job

2014-04-14 Thread Ralph Castain



On Apr 14, 2014, at 9:40 AM, Michael Jennings m...@lbl.gov wrote:

 
 On Mon, Apr 14, 2014 at 9:18 AM, E V eliven...@gmail.com wrote:
 
 I noticed you excluded the configure script and other stuff needed to build 
 the package. I guess I'll add it back in on my fork. If you don't want it in 
 your repo I'll just put it on a build branch.
 
 configure and other auto-generated files should *never* be in SCM
 trees such as GitHub.  Use the autogen.sh script to generate them.
 configure.ac is the source file, and it's in the original
 slurm-drmaa tree.
 

While I agree with the sentiment...have you looked at the Slurm repo?


 Michael
 
 -- 
 Michael Jennings m...@lbl.gov
 Senior HPC Systems Engineer
 High-Performance Computing Services
 Lawrence Berkeley National Laboratory
 Bldg 50B-3209EW: 510-495-2687
 MS 050B-3209  F: 510-486-8615

[slurm-dev] Re: backfill scheduler look ahead?

2014-02-25 Thread Ralph Castain


On Feb 25, 2014, at 7:34 AM, Moe Jette je...@schedmd.com wrote:

 
 Quoting Yuri D'Elia wav...@thregr.org:
 
 
 On 02/20/2014 07:21 PM, Moe Jette wrote:
 
 Slurm uses what is known as a conservative backfill scheduling
 algorithm. No job will be started that adversely impacts the expected
 start time of _any_ higher priority job. The scheduling can also be
 effected by a job's requirements for memory, generic resources,
 licenses, and resource limits.
 
 I'm curious whether this could be changed with a setting to disregard
 the expected start time of higher priority jobs.
 
 Given that giving/estimating completion times of jobs is akin to sorcery
 in many cases, it would be beneficial in my case to always
 under-estimate the time limit.
 
 I'm wondering if anybody is running with a overly-conservative TimeLimit
 for jobs, and abusing OverTimeLimit [very high value] to achieve this.
 
 I know I would definitely use a EstimatedTimeLimit parameter for
 improved backfilling and give an absolute ceiling with TimeLimit (if I
 could).
 
 I haven't had time to work on this, but one idea would be estimate a job's 
 run time based upon historic data and use that as a basis for backfill 
 scheduling. I suspect the results would be better responsiveness and higher 
 utilization than when basing scheduling decisions upon the user's time limit.

FWIW: that has worked very poorly in the past. The problem is that the workload 
depends heavily upon the data set, and so past performance is a very poor 
indicator of future behavior except in rare circumstances (e.g., a nightly 
weather forecast where the data is consistent night after night).



 Moe Jette
 SchedMD

[slurm-dev] Re: Slurm daemon on Xeon Phi cards?

2014-02-17 Thread Ralph Castain


I know others have direct-launched processes onto the Phi before, both with 
Slurm and just using rsh/ssh. The OpenMPI user mailing list archive talks about 
the ssh method (search for phi and you'll see the chatter)

http://www.open-mpi.org/community/lists/users/

and the folks at Bright talk about how they did it with Slurm here:

https://www.linkedin.com/groups/Yes-we-run-SLURM-inside-4501392.S.5792769036550955008

Ralph

On Feb 17, 2014, at 5:46 PM, Christopher Samuel sam...@unimelb.edu.au wrote:

 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Hi all,
 
 At the Slurm User Group in Oakland last year it was mentioned that
 there was intended to be support for a lightweight Slurm daemon on
 Xeon Phi (MIC) cards.
 
 I had a quick look in the git master last night but couldn't spot
 anything related, is this still the intention?
 
 Olli-Pekka Lehto from CSC is running a Xeon Phi workshop at VLSCI at
 the moment and it's of interest to a number of us.
 
 We're going to run a hack day on Wednesday and we'll see if we can
 build an LDAP enabled Xeon Phi stack, if we can then we we'll see if
 we can get standard Slurm going too. Nothing like having lofty goals!
 
 All the best!
 Chris
 - -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci
 
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.14 (GNU/Linux)
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
 
 iEYEARECAAYFAlMCuvIACgkQO2KABBYQAh+pwgCcCLPvoUJamArfmpxY5igcJm3I
 0p0AnjF51qUgZfoZtIsKTDLCK+pJe+bf
 =7HO3
 -END PGP SIGNATURE-

[slurm-dev] Re: Bug in pmi2_api.c

2013-10-03 Thread Ralph Castain



On Oct 3, 2013, at 10:00 AM, Michael Jennings m...@lbl.gov wrote:

 On Thu, Oct 3, 2013 at 9:44 AM, David Bigagli da...@schedmd.com wrote:
 
   I don't know the details of the segfault but the code in question is
 correct. If you decrease the length then the file cmdlen:
 
 cmdlen = PMII_MAX_COMMAND_LEN - remaining_len;
 
 will not be correct and wrong length will be sent to the pmi2 server.
 
 This code is taken verbatim from mpich2-1.5, precisely from:
 mpich2-1.5/src/pmi/pmi2/simple2pmi.c
 
 If that's true, the original code is also wrong. :-)

Indeed - bottom line is that just because it is in mpich doesn't mean it is 
correct :-)

 
 Ralph is correct here.  Let's walk through this:
 
char cmdbuf[PMII_MAX_COMMAND_LEN];
char *c = cmdbuf;
int remaining_len = PMII_MAX_COMMAND_LEN;
 
/* leave space for length field */
memset(c, ' ', PMII_COMMANDLEN_SIZE);
c += PMII_COMMANDLEN_SIZE;
 
 Now let's say PMII_MAX_COMMAND_LEN is 4 and PMII_COMMANDLEN_SIZE is 1.
 cmdbuf and c start out pointing to a 4-byte buffer.  We then set the
 first char to ' ' and point c to the 2nd position in the buffer.
 remaining_len is still 4.
 
 Now, later we have:
 
ret = snprintf(c, remaining_len, thrid=%p;, resp);
 
 This will segfault because it thinks it can write 4 chars to c when,
 in fact, it can only write 3.  Obviously these small buffer sizes are
 contrived, but they illustrate the point.  :-)
 
 If the resulting command length value depends on this bug being
 present, some mitigation will have to happen later on - either changes
 in every snprintf() call, or a last-minute alteration to the command
 length (probably the wiser idea).

Thanks - that matches our analysis too

 
 Michael
 
 -- 
 Michael Jennings m...@lbl.gov
 Senior HPC Systems Engineer
 High-Performance Computing Services
 Lawrence Berkeley National Laboratory
 Bldg 50B-3209EW: 510-495-2687
 MS 050B-3209  F: 510-486-8615

[slurm-dev] Re: Bug in pmi2_api.c

2013-10-03 Thread Ralph Castain


Cool - thanks David!

On Oct 3, 2013, at 11:31 AM, David Bigagli da...@schedmd.com wrote:

 Fixed. I chose the method proposed by Michael, subtract first, add later. :-)
 
 On 10/03/2013 10:29 AM, Ralph Castain wrote:
 
 
 On Oct 3, 2013, at 10:16 AM, David Bigagli da...@schedmd.com wrote:
 
 
 
 I am not saying that remaining_len is correct or that mpich is bugless :-) 
 I am only saying that decrementing remaining_len as proposed breaks the 
 pmi2 protocol as since the client sends a wrong length to the server. We 
 actually applied the very same fix in the past and then had to undo it 
 commit: 1e25eb10df0f.
 
 Funny enough Ralph was one of those that reported that broken protocol bug 
 :-) mpi/pmi2: no value for key %s in req.
 
 Ha! Yes indeed, we do see that reappear. :-/
 
 
 We can either use sprintf() as there is no need to use snprintf() here or 
 do snprintf(c, remaining_len - PMII_COMMANDLEN_SIZE, thrid=%p;, resp);
 
 Either way is fine with me - feel free to choose.
 Thanks!
 
 
 On 10/03/2013 09:50 AM, Ralph Castain wrote:
 
 I'm afraid that isn't quite correct, David. If you walk thru the code, 
 you'll see that the snprintf is being given an incorrect length for the 
 buffer. When using some libc versions, snprintf automatically fills the 
 given buffer with NULLs to ensure that the string is NULL-terminated. When 
 it does so, the code as currently written causes a segfault.
 
 We have tested the provided change and it fixes the problem. If your code 
 review indicates that remaining_len should not be changed where we did it, 
 then you need to at least change the snprintf command to reflect the 
 reduced size of the c buffer.
 
 On Oct 3, 2013, at 9:44 AM, David Bigagli da...@schedmd.com wrote:
 
 
 
 Hi,
   I don't know the details of the segfault but the code in question is 
 correct. If you decrease the length then the file cmdlen:
 
 cmdlen = PMII_MAX_COMMAND_LEN - remaining_len;
 
 will not be correct and wrong length will be sent to the pmi2 server.
 
 This code is taken verbatim from mpich2-1.5, precisely from: 
 mpich2-1.5/src/pmi/pmi2/simple2pmi.c
 
 On 10/03/2013 08:47 AM, Ralph Castain wrote:
 
 Hi folks
 
 We have uncovered a segfault-causing bug in contribs/pmi2/pmi2_api.c. 
 Looking at the code in the current trunk, you need to add the following 
 at line 1499:
 
 remaining_len -= PMII_COMMANDLEN_SIZE;
 
 as you have changed the location of the pointer to the buffer, but 
 failed to account for that change in the size param being later passed 
 to snprintf.
 
 Could you please fix this and include it in 2.6.3?
 
 Thanks
 Ralph
 
 
 --
 
 Thanks,
  /David
 
 --
 
 Thanks,
  /David
 
 -- 
 
 Thanks,
  /David

[slurm-dev] Re: slurm openmpi number of cores per task

2013-09-13 Thread Ralph Castain

Configure OMPI --with-pmi as the port reservation method won't work in this 
scenario.

On Sep 13, 2013, at 8:29 AM, Yann Sagon ysa...@gmail.com wrote:

 I have that set in slurm.conf:
 
 MpiDefault=openmpi
 MpiParams=ports=12000-12999
 
 The cluster is 56 nodes of 16 cores. Do I need to increase something?
 
 If I issue this on my nodes, nothings appears:
 
 cexec netstat -laputen | grep ':12[0-9]\{3\}'
 
 
 
 
 2013/9/13 Moe Jette je...@schedmd.com
 
 I suspect the problem is related to reserved ports described here:
 http://slurm.schedmd.com/mpi_guide.html#open_mpi
 
 
 Quoting Yann Sagon ysa...@gmail.com:
 
 (sorry for the previous post, bad manipulation)
 
 Hello,
 
 I'm facing the following problem: one of our user wrote a simple c wrapper
 that launches a multithreaded program. It was working before an update of
 the cluster (os, and ofed).
 
 the wrapper is invoked like that:
 
 $srun -n64 -c4 wrapper
 
 The result is something like that:
 
 [...]
 srun: error: node04: task 12: Killed
 srun: error: node04: tasks 13-15 unable to claim reserved port, retrying.
 srun: Terminating job step 47498.0
 slurmd[node04]: *** STEP 47498.0 KILLED AT 2013-09-13T17:13:33 WITH SIGNAL
 9 ***
 [...]
 
 If we call the wrapper like that:
 
 $srun -n64 wrapper
 
 it is working but we have only one core per thread.
 
 We were using slurm 2.5.4, now I tried with 2.6.2
 Tested with openmpi 1.6.4 and 1.6.5
 
 
 here is the code of the wrapper:
 
 #include stdio.h
 #include stdlib.h
 #include mpi.h
 
 int main(int argc, char *argv[])
 {
 int rank, size;
 char buf[512];
 
 MPI_Init(argc, argv);
 MPI_Comm_rank(MPI_COMM_WORLD, rank);
 MPI_Comm_size(MPI_COMM_WORLD, size);
 sprintf(buf, the_multithreaded_binary %d %d, rank, size);
 system(buf);
 MPI_Finalize();
 
 return 0;
 }

[slurm-dev] Re: slurm openmpi number of cores per task

2013-09-13 Thread Ralph Castain


Did you configure OMPI --with-pmi? If not, then this won't work.


On Sep 13, 2013, at 8:21 AM, Moe Jette je...@schedmd.com wrote:

 
 I suspect the problem is related to reserved ports described here:
 http://slurm.schedmd.com/mpi_guide.html#open_mpi
 
 Quoting Yann Sagon ysa...@gmail.com:
 
 (sorry for the previous post, bad manipulation)
 
 Hello,
 
 I'm facing the following problem: one of our user wrote a simple c wrapper
 that launches a multithreaded program. It was working before an update of
 the cluster (os, and ofed).
 
 the wrapper is invoked like that:
 
 $srun -n64 -c4 wrapper
 
 The result is something like that:
 
 [...]
 srun: error: node04: task 12: Killed
 srun: error: node04: tasks 13-15 unable to claim reserved port, retrying.
 srun: Terminating job step 47498.0
 slurmd[node04]: *** STEP 47498.0 KILLED AT 2013-09-13T17:13:33 WITH SIGNAL
 9 ***
 [...]
 
 If we call the wrapper like that:
 
 $srun -n64 wrapper
 
 it is working but we have only one core per thread.
 
 We were using slurm 2.5.4, now I tried with 2.6.2
 Tested with openmpi 1.6.4 and 1.6.5
 
 
 here is the code of the wrapper:
 
 #include stdio.h
 #include stdlib.h
 #include mpi.h
 
 int main(int argc, char *argv[])
 {
int rank, size;
char buf[512];
 
MPI_Init(argc, argv);
MPI_Comm_rank(MPI_COMM_WORLD, rank);
MPI_Comm_size(MPI_COMM_WORLD, size);
sprintf(buf, the_multithreaded_binary %d %d, rank, size);
system(buf);
MPI_Finalize();
 
return 0;
 }

[slurm-dev] Re: fairshare incrementing

2013-08-23 Thread Ralph Castain


Perhaps it is a copy/paste error - but those two tables are identical

On Aug 23, 2013, at 12:14 PM, Alan V. Cowles alan.cow...@duke.edu wrote:

 
 Final update for the day, we have found what is causing priority to be 
 overlooked we just don't know what is causing it...
 
 [root@cluster-login ~]# squeue  --format=%a %.7i %.9P %.8j %.8u %.8T %.10M 
 %.9l %.6D %R |grep user1
 (null)  181378lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181379lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181380lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181381lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181382lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181383lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181384lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181385lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181386lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181387lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 
 Compared to:
 
 [root@cluster-login ~]# squeue  --format=%a %.7i %.9P %.8j %.8u %.8T %.10M 
 %.9l %.6D %R |grep user2
 account  181378lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181379lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181380lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181381lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181382lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181383lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181384lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181385lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181386lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181387lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 
 
 We have tried to create new users and new accounts this afternoon and all of 
 them show (null) as their account when we break out the formatting rules on 
 sacct.
 
 sacctmgr add account accountname
 sacctmgr add user username defaultaccount accountname
 
 We have even one case where all users under and account are working fine 
 except a user we added yesterday... so at some point in the past (logs aren't 
 helping us thus far) the ability to actually sync up a user and an account 
 for accounting purposes has left us. Also I have failed to mention to this 
 point that we are still running Slurm 2.5.4, my apologies for that.
 
 AC
 
 
 On 08/23/2013 11:22 AM, Alan V. Cowles wrote:
 Sorry to spam the list, but we wanted to keep updates in flux.
 
 We managed to find the issue in our mysqldb we are using for job accounting 
 which had the column value set to smallint (5) for that value, so it was 
 rounding things off, some SQL magic and we now have appropriate uid's 
 showing up. A new monkey wrench, some test jobs submitted by user3 below get 
 their fairshare value of 5000 as expected, just not user2... we just cleared 
 his jobs from the queue, and submitted another 100 jobs for testing and none 
 of them got a fairshare value...
 
 In his entire history of using our cluster he hasn't submitted over 5000 
 jobs, in fact:
 
 [root@slurm-master ~]# sacct -c 
 --format=user,jobid,jobname,start,elapsed,state,exitcode -u user2 | grep 
 user2 | wc -l
 2573
 
 So we can't figure out why he's being overlooked.
 
 AC
 
 
 On 08/23/2013 10:31 AM, Alan V. Cowles wrote:
 We think we may be onto something, in sacct we were looking at the jobs 
 submitted by the users, and found that many users share the same uidnumber 
 in the slurm database. It seems to correlate with the size of the user's 
 uid number in our ldap directory... users who's uid number are greater than 
 65535 get trunked to that number... users with uid numbers below that keep 
 their correct uidnumbers (user2 in the sample output below)
 
 
 
 
 [root@slurm-master ~]# sacct -c 
 --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state
  |grep user2|head
 user2  27545 30548   bwa node01-1 2013-07-08T13:04:25   
 00:00:48  0:0 COMPLETED
 user2  27545 30571   bwa node01-1 2013-07-08T15:18:00   
 00:00:48  0:0 COMPLETED
 user2  27545 30573   bwa node01-1 2013-07-09T09:40:59   
 00:00:48  0:0 COMPLETED
 user2  27545 30618  grep node01-1 2013-07-09T11:57:12   
 00:00:48  0:0 COMPLETED
 user2  27545 30619bc node01-1 2013-07-09T11:58:08   
 00:00:48  0:0 CANCELLED
 user2  27545 30620

[slurm-dev] Re: fairshare incrementing

2013-08-23 Thread Ralph Castain


Ah, never mind - I see the difference now. Was looking for some info to be 
different


On Aug 23, 2013, at 12:17 PM, Ralph Castain r...@open-mpi.org wrote:

 Perhaps it is a copy/paste error - but those two tables are identical
 
 On Aug 23, 2013, at 12:14 PM, Alan V. Cowles alan.cow...@duke.edu wrote:
 
 
 Final update for the day, we have found what is causing priority to be 
 overlooked we just don't know what is causing it...
 
 [root@cluster-login ~]# squeue  --format=%a %.7i %.9P %.8j %.8u %.8T %.10M 
 %.9l %.6D %R |grep user1
 (null)  181378lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181379lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181380lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181381lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181382lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181383lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181384lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181385lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181386lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 (null)  181387lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
 (Priority)
 
 Compared to:
 
 [root@cluster-login ~]# squeue  --format=%a %.7i %.9P %.8j %.8u %.8T %.10M 
 %.9l %.6D %R |grep user2
 account  181378lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181379lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181380lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181381lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181382lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181383lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181384lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181385lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181386lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 account  181387lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
 (Priority)
 
 
 We have tried to create new users and new accounts this afternoon and all of 
 them show (null) as their account when we break out the formatting rules on 
 sacct.
 
 sacctmgr add account accountname
 sacctmgr add user username defaultaccount accountname
 
 We have even one case where all users under and account are working fine 
 except a user we added yesterday... so at some point in the past (logs 
 aren't helping us thus far) the ability to actually sync up a user and an 
 account for accounting purposes has left us. Also I have failed to mention 
 to this point that we are still running Slurm 2.5.4, my apologies for that.
 
 AC
 
 
 On 08/23/2013 11:22 AM, Alan V. Cowles wrote:
 Sorry to spam the list, but we wanted to keep updates in flux.
 
 We managed to find the issue in our mysqldb we are using for job accounting 
 which had the column value set to smallint (5) for that value, so it was 
 rounding things off, some SQL magic and we now have appropriate uid's 
 showing up. A new monkey wrench, some test jobs submitted by user3 below 
 get their fairshare value of 5000 as expected, just not user2... we just 
 cleared his jobs from the queue, and submitted another 100 jobs for testing 
 and none of them got a fairshare value...
 
 In his entire history of using our cluster he hasn't submitted over 5000 
 jobs, in fact:
 
 [root@slurm-master ~]# sacct -c 
 --format=user,jobid,jobname,start,elapsed,state,exitcode -u user2 | grep 
 user2 | wc -l
 2573
 
 So we can't figure out why he's being overlooked.
 
 AC
 
 
 On 08/23/2013 10:31 AM, Alan V. Cowles wrote:
 We think we may be onto something, in sacct we were looking at the jobs 
 submitted by the users, and found that many users share the same uidnumber 
 in the slurm database. It seems to correlate with the size of the user's 
 uid number in our ldap directory... users who's uid number are greater 
 than 65535 get trunked to that number... users with uid numbers below that 
 keep their correct uidnumbers (user2 in the sample output below)
 
 
 
 
 [root@slurm-master ~]# sacct -c 
 --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state
  |grep user2|head
 user2  27545 30548   bwa node01-1 2013-07-08T13:04:25   
 00:00:48  0:0 COMPLETED
 user2  27545 30571   bwa node01-1 2013-07-08T15:18:00   
 00:00:48  0:0 COMPLETED
 user2  27545 30573   bwa node01-1 2013-07-09T09:40:59   
 00:00:48  0:0 COMPLETED
 user2  27545 30618  grep node01-1 2013-07-09T11:57

[slurm-dev] Re: [OMPI devel] slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)

2013-08-11 Thread Ralph Castain


I can't speak to what you get from sacct, but I can say that things will 
definitely be different when launched directly via srun vs indirectly thru 
mpirun. The reason is that mpirun uses srun to launch the orte daemons, which 
then fork/exec all the application processes under them (as opposed to 
launching those app procs thru srun). This means two things:

1. Slurm has no direct knowledge or visibility into the application procs 
themselves when launched by mpirun. Slurm only sees the ORTE daemons. I'm sure 
that Slurm rolls up all the resources used by those daemons and their children, 
so the totals should include them

2. Since all Slurm can do is roll everything up, the resources shown in sacct 
will include those used by the daemons and mpirun as well as the application 
procs. Slurm doesn't include their daemons or the slurmctld in their 
accounting. so the two numbers will be significantly different. If you are 
attempting to limit overall resource usage, you may need to leave some slack 
for the daemons and mpirun.

You should also see an extra step in the mpirun-launched job as mpirun itself 
generally takes the first step, and the launch of the daemons occupies a second 
step.

As for the strange numbers you are seeing, it looks to me like you are hitting 
a mismatch of unsigned vs signed values. When adding them up, that could cause 
all kinds of erroneous behavior.


On Aug 6, 2013, at 11:55 PM, Christopher Samuel sam...@unimelb.edu.au wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 07/08/13 16:19, Christopher Samuel wrote:
 
 Anyone seen anything similar, or any ideas on what could be going
 on?
 
 Sorry, this was with:
 
 # ACCOUNTING
 JobAcctGatherType=jobacct_gather/linux
 JobAcctGatherFrequency=30
 
 Since those initial tests we've started enforcing memory limits (the
 system is not yet in full production) and found that this causes jobs
 to get killed.
 
 We tried the cgroups gathering method, but jobs still die with mpirun
 and now the numbers don't seem to right for mpirun or srun either:
 
 mpirun (killed):
 
 [samuel@barcoo-test Mem]$ sacct -j 94564 -o JobID,MaxRSS,MaxVMSize
   JobID MaxRSS  MaxVMSize
 -  -- --
 94564
 94564.batch-523362K  0
 94564.0 394525K  0
 
 srun:
 
 [samuel@barcoo-test Mem]$ sacct -j 94565 -o JobID,MaxRSS,MaxVMSize
   JobID MaxRSS  MaxVMSize
 -  -- --
 94565
 94565.batch998K  0
 94565.0  88663K  0
 
 
 All the best,
 Chris
 - -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci
 
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.11 (GNU/Linux)
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
 
 iEYEARECAAYFAlIB73wACgkQO2KABBYQAh+kwACfYnMbONcpxD2lsM5i4QDw5r93
 KpMAn2hPUxMJ62u2gZIUGl5I0bQ6lllk
 =jYrC
 -END PGP SIGNATURE-
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel

[slurm-dev] Re: Understanding PMI2 support in SLURM 2.6.0

2013-07-25 Thread Ralph Castain


Just to be clear: there was indeed a bug in the Slurm pmi2 code that was 
emitting the reported error. That bug was fixed and the patch pushed into both 
trunk and 2.6 repositories on Tues.


On Jul 25, 2013, at 3:27 AM, Riebs, Andy andy.ri...@hp.com wrote:

 
 Thanks for the info!
 
 
 
 --
 Andy Riebs
 andy.ri...@hp.com
 
 
 
  Original message 
 From: Hongjia Cao hj...@nudt.edu.cn
 Date: 07/25/2013 12:25 AM (GMT-05:00)
 To: slurm-dev slurm-dev@schedmd.com
 Subject: [slurm-dev] Re: Understanding PMI2 support in SLURM 2.6.0
 
 
 
 在 2013-07-22一的 07:46 -0700，Andy Riebs写道：
 Hi,
 
 We're trying to understand how PMI2 support works with SLURM, and have
 come up with a test program (see below) that demonstrates unexpected
 results.
 
 Our questions:
 1. What does it mean, when --mpi=pmi2 is _not_ specified, for
PMI2_Init() to return success, but leave size,rank,appnum
unchanged?
 You can run your program directly (execute ./a.out instead of srun -n
 2 ./a.out) and get similar results. When --mpi=pmi2 not specified, the
 PMI2 client library takes that the program is run singleton. srun -n 2
 a.out will run two copies of the program with parallel size 1. You will
 get 1 in numprocs after calling
 MPI_Comm_size(MPI_COMM_WORLD,numprocs); in MPI programs.
 
 
 1. What does it mean, or what is going wrong, when --mpi=pmi2 is
specified, to get the“slurmd[hadesn10]: mpi/pmi2: no value for
key  in req” lines?
 
 
 This message will not appear when running MPI programs. The following is
 a code segment taken from contribs/pmi2/pmi2_api.c (function
 PMIi_WriteSimpleCommand()) of SLURM:
 
 1481 int pair_index;
 1482
 1483 PMI2U_printf([BEGIN]);
 1484
 1429 ssize_t nbytes;
 1430 ssize_t offset;
 1485 /* leave space for length field */
 1486 memset(c, ' ', PMII_COMMANDLEN_SIZE);
 1487 c += PMII_COMMANDLEN_SIZE;
 1488 remaining_len -= PMII_COMMANDLEN_SIZE;
 1489
 1490 PMI2U_ERR_CHKANDJUMP(strlen(cmd)  PMI2_MAX_VALLEN, pmi2_errno,
 PMI2_ERR_OTHER, **cmd_too_long);
 1491
 
 the above line 1488 is missing in the corresponding file
 src/pmi/pmi2/simple2pmi.c of MPICH:
 
 1431 int pair_index;
 1432
 1433 /* leave space for length field */
 1434 memset(c, ' ', PMII_COMMANDLEN_SIZE);
 1435 c += PMII_COMMANDLEN_SIZE;
 1436
 1437 PMI2U_ERR_CHKANDJUMP(strlen(cmd)  PMI2_MAX_VALLEN, pmi2_errno,
 PMI2_ERR_OTHER, **cmd_too_long);
 1438
 
 It is not clear which is correct according to the design documents of
 PMI2(http://wiki.mpich.org/mpich/index.php/PMI_v2_Wire_Protocol). But
 the mpi/pmi2 plugin in SLURM which implements the server part of the PMI
 protocol comforms to the MPICH implementation.
 
 Andy
 
 
 The program:
 
 --
 
 
 /*
 
 Using SLURM 2.6.0
 
 To build and run it:
 
 
cc -Wall pmi2-001.c -I/opt/slurm/include -L/opt/slurm/lib64 -lpmi2
  echo  
srun -n 2 ./a.out  echo   srun --mpi=pmi2 -n 2 ./a.out
 
 Sample output:
 
 
 Init = 0; spawned = 0, size = -99, rank = -99, appnum = -99
 
 Job_GetId = 14; id =
 
 Init = 0; spawned = 0, size = -99, rank = -99, appnum = -99
 
 Job_GetId = 14; id =
 
 
 
 slurmd[hadesn10]: mpi/pmi2: no value for key  in req
 
 Init = 0; spawned = 0, size = 2, rank = 1, appnum = -1
 
 Job_GetId = 0; id = 252.0
 
 slurmd[hadesn10]: mpi/pmi2: no value for key  in req
 
 Init = 0; spawned = 0, size = 2, rank = 0, appnum = -1
 
 Job_GetId = 0; id = 252.0
 
 
 
 */
 
 
 
 #include stdio.h
 
 #include slurm/pmi2.h
 
 
 
 int
 
 main()
 
 {
 
  int ret;
 
  {
 
int spawned = -99, size = -99, rank = -99, appnum = -99;
 
ret = PMI2_Init(spawned, size, rank, appnum);
 
printf(Init = %d; spawned = %d, size = %d, rank = %d, appnum = %
 d\n,
 
 ret, spawned, size, rank, appnum);
 
  }
 
  {
 
char id[PMI2_MAX_KEYLEN];
 
ret = PMI2_Job_GetId(id, sizeof(id));
 
printf(Job_GetId = %d; id = %s\n, ret, id);
 
  }
 
  fflush(NULL);
 
  return 0;
 
 }
 
 --
 --
 Andy Riebs
 Hewlett-Packard Company
 High Performance Computing
 +1 404 648 9024
 My opinions are not necessarily those of HP

[slurm-dev] Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-07-23 Thread Ralph Castain


On Jul 23, 2013, at 9:59 AM, Tim Wickberg wickb...@gwu.edu wrote:

 I'm assuming the jobs are running across multiple nodes, using MPI for 
 communication?
 
 I'm guessing that srun is resulting in communication going across a GigE 
 fabric rather than IB, where mpirun directly is using the IB. A ~20% 
 performance penalty would make sense in that context.

This would happen only if someone specified that mpirun use the ip-over-ib 
interface, which users typically don't do in order to maintain separation 
between the out-of-band and MPI traffic


 
 - Tim
 
 --
 Tim Wickberg
 wickb...@gwu.edu
 Senior HPC Systems Administrator
 The George Washington University
 
 
 On Tue, Jul 23, 2013 at 3:06 AM, Christopher Samuel sam...@unimelb.edu.au 
 wrote:
 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Hi there slurm-dev and OMPI devel lists,
 
 Bringing up a new IBM SandyBridge cluster I'm running a NAMD test case
 and noticed that if I run it with srun rather than mpirun it goes over
 20% slower.  These are all launched from an sbatch script too.
 
 Slurm 2.6.0, RHEL 6.4 (latest kernel), FDR IB.
 
 Here are some timings as reported as the WallClock time by NAMD itself
 (so not including startup/tear down overhead from Slurm).
 
 srun:
 
 run1/slurm-93744.out:WallClock: 695.079773  CPUTime: 695.079773
 run4/slurm-94011.out:WallClock: 723.907959  CPUTime: 723.907959
 run5/slurm-94013.out:WallClock: 726.156799  CPUTime: 726.156799
 run6/slurm-94017.out:WallClock: 724.828918  CPUTime: 724.828918
 
 Average of 692 seconds
 
 mpirun:
 
 run2/slurm-93746.out:WallClock: 559.311035  CPUTime: 559.311035
 run3/slurm-93910.out:WallClock: 544.116333  CPUTime: 544.116333
 run7/slurm-94019.out:WallClock: 586.072693  CPUTime: 586.072693
 
 Average of 563 seconds.
 
 So that's about 23% slower.
 
 Everything is identical (they're all symlinks to the same golden
 master) *except* for the srun / mpirun which is modified by copying
 the batch script and substituting mpirun for srun.
 
 When they are running I can see that for jobs launched with srun they
 are direct children of slurmstepd whereas when started with mpirun
 they are children of Open-MPI's orted (or mpirun on the launch node)
 which itself is a child of slurmstepd.
 
 Has anyone else seen anything like this, or got any ideas?
 
 cheers,
 Chris
 - --
  Christopher SamuelSenior Systems Administrator
  VLSCI - Victorian Life Sciences Computation Initiative
  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
  http://www.vlsci.org.au/  http://twitter.com/vlsci
 
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.11 (GNU/Linux)
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
 
 iEYEARECAAYFAlHuKxoACgkQO2KABBYQAh8cYQCfT/YIFkyeDaNb/ksT2xk4W416
 kycAoJfdZInLwy+nTIL7CzWapZZU20qm
 =ZJ1B
 -END PGP SIGNATURE-

[slurm-dev] Re: pmi2?

2013-07-18 Thread Ralph Castain


It was built using Open MPI, which was configured with

set SLURMLOC = /opt/slurm;  ./configure --with-slurm=$SLURMLOC 
--with-pmi=$SLURMLOC --with-verbs --without-mpi-param-check 
--enable-orterun-prefix-by-default --prefix=$PWD/install 
--with-mxm=/opt/mellanox/mxm --with-fca=/opt/mellanox/fca CFLAGS=-g -O3

and from what we can tell have both PMI and PMI2 built in.

The SLURM is the recently released 2.6.0 (presumably built straightforwardly, 
but I don’t have the details).

On Jul 18, 2013, at 6:31 AM, Hongjia Cao hj...@nudt.edu.cn wrote:

 
 Could you tell how is the program foo built? Which MPI version and PMI
 library are you using?
 
 在 2013-07-17三的 13:16 -0700，Ralph Castain写道：
 Hi folks
 
 
 We're trying to test the pmi2 support in 2.6.0 and hitting a problem.
 We have verified that the pmi2 support was built/installed, and that
 both slurmctld and slurmd are at 2.6.0 level. When we run srun
 --mpi-list, we get:
 
 
 srun: MPI types are... 
 srun: mpi/mvapich
 srun: mpi/pmi2
 srun: mpi/mpich1_shmem
 srun: mpi/mpich1_p4
 srun: mpi/none
 srun: mpi/lam
 srun: mpi/openmpi
 srun: mpi/mpichmx
 srun: mpi/mpichgm
 
 
 So it looks like the install is correct. However, when we attempt to
 run a job with srun --mpi=pmi2 foo, we get an error from the slurmd
 on the remote node:
 
 
 slurmd[n1]: mpi/pmi2: no value for key  in req
 
 
 and the PMI calls in the app fail. Any ideas as to the source of the
 problem? Do we have to configure something else, or start slurmd with
 some option?
 
 
 Thanks
 Ralph

[slurm-dev] pmi2?

2013-07-17 Thread Ralph Castain

Hi folks

We're trying to test the pmi2 support in 2.6.0 and hitting a problem. We have 
verified that the pmi2 support was built/installed, and that both slurmctld and 
slurmd are at 2.6.0 level. When we run srun --mpi-list, we get:

srun: MPI types are... 
srun: mpi/mvapich
srun: mpi/pmi2
srun: mpi/mpich1_shmem
srun: mpi/mpich1_p4
srun: mpi/none
srun: mpi/lam
srun: mpi/openmpi
srun: mpi/mpichmx
srun: mpi/mpichgm

So it looks like the install is correct. However, when we attempt to run a job 
with srun --mpi=pmi2 foo, we get an error from the slurmd on the remote node:

slurmd[n1]: mpi/pmi2: no value for key  in req

and the PMI calls in the app fail. Any ideas as to the source of the problem? 
Do we have to configure something else, or start slurmd with some option?

Thanks
Ralph

[slurm-dev] Re: Job Groups

2013-06-19 Thread Ralph Castain


Could you just create a dedicated queue for those jobs, and then configure its 
priority and max simultaneous settings? Then all they would have to do is 
ensure they submit those jobs to that queue.

On Jun 19, 2013, at 8:36 AM, Paul Edmon ped...@cfa.harvard.edu wrote:

 
 I have a group here that wants to submit a ton of jobs to the queue, but 
 want to restrict how many they have running at any given time so that 
 they don't torch their fileserver.  They were using bgmod -L in LSF to 
 do this, but they were wondering if there was a similar way in SLURM to 
 do so.  I know you can do this via the accounting interface but it would 
 be good if I didn't have to apply it as a blanket to all their jobs and 
 if they could manage it themselves.
 
 If nothing exists in SLURM to do this that's fine.  One can always 
 engineer around it.  I figured I would ping the dev list first before 
 putting a nail in it.  From my look at the documentation I don't see 
 anyway to do this other than what I stated above.
 
 -Paul Edmon-

[slurm-dev] Re: Slurmctld multithreaded?

2013-06-12 Thread Ralph Castain

Not isolating, but blocking. If you have more ports, I believe it will add more 
threads to listen on those ports. Each RPC received blocks until it completes, 
so having more ports should improve thruput.


On Jun 12, 2013, at 10:03 AM, Alan V. Cowles alan.cow...@duke.edu wrote:

 No we have it set exclusively to 6817, and slurmdPort 2 lines later to 6818.
 
 Is it isolating to processors based on incoming port?
 
 AC
 
 On 06/12/2013 01:00 PM, Lyn Gerner wrote:
 Alan, are you using the port range option on SlurmctldPort (e.g., 
 SlurmctldPort=6817-6818) in slurm.conf?
 
 
 On Wed, Jun 12, 2013 at 9:55 AM, Alan V. Cowles alan.cow...@duke.edu wrote:
 
 Under the Data Objects section on the following page
 http://slurm.schedmd.com/selectplugins.html we find the statement:
 
 Slurmctld is a multi-threaded program with independent read and write
 locks on each data structure type.
 
 Which is what lead me to believe it's there, that we perhaps missed a
 configuration option.
 
 AC
 
 
 
 On 06/12/2013 12:43 PM, Paul Edmon wrote:
  I'm also interested in this as I've only ever seen one slurmctld and
  only at 100%.  It would be good if making slurm multithreaded was on the
  path for the future.  I know we will have 100,000's of jobs in flight
  for our config so it would be good to have something that can take that
  load.
 
  -Paul Edmon-
 
  On 06/12/2013 12:30 PM, Alan V. Cowles wrote:
  Hey Guys,
 
  I've seen a few references to the slurmctld as a multithreaded process
  but it doesn't seem that way.
 
  We had a user submit 18000 jobs to our cluster (512 slots) and it shows
  512 fully loaded, shows those jobs running, shows about 9800 currently
  pending, but upon her submission threw errors around 16500.
 
  Submitted batch job 16589
  Submitted batch job 16590
  Submitted batch job 16591
  sbatch: error: Slurm temporarily unable to accept job, sleeping and
  retrying.
  sbatch: error: Batch job submission failed: Resource temporarily
  unavailable.
 
  The thing we noticed at this time on our master host is that slurmctld
  was pegging at 100% on one cpu quite regularly and paged 16GB of virtual
  memory, while all other cpu's were completely idle.
 
  We wondered if the pegging out of the control daemon is what led to the
  submission failure, as we haven't found any limits set anywhere to any
  specific job or user, and wondered if perhaps we missed a configure
  option for this when we did our original install.
 
  Any thoughts or ideas? We're running Slurm 2.5.4 on RHEL6.
 
  AC

[slurm-dev] Re: PMI_Exit? was Re: PMI_Abort with zero value

2013-06-04 Thread Ralph Castain

The PMI group is handled elsewhere, but I agree that we need a different
API for this purpose. We see conditional execution in Hadoop and elsewhere,
so it is a reasonable use-case - just need to leave abort as something
different. I'd propose we create something like PMI_Terminate as a separate
way of implementing it.

I'll pose the question to the PMI folks and post the conclusion here as
well.
Ralph

On Tue, Jun 4, 2013 at 6:52 AM, Andy Riebs andy.ri...@hp.com wrote:

I don't know the provenance of the PMI specification, but would it be
possible to add a new function (at least within SLURM's PMI implementation)
with the effect that Victor describes? Legacy SHMEM implementations have
provided globalexit() and, should OpenSHMEM evolve to include it, it will
likely have the semantics that globalexit(0) should cause the launcher to
exit with 0.

Andy

On 06/04/2013 07:59 AM, Victor Kocheganov wrote:

OK, I see your points: I did not suspect it would be so inconvenient to
have such a behavior, but all the reasons look convenient. The source of
requirement is just will of our users.
Will try to find another approach then.

Thanks for detail explanation, Ralph!

On Tue, Jun 4, 2013 at 3:34 AM, Ralph Castain r...@open-mpi.org wrote:

The OMPI developers were meeting this afternoon, so we took advantage
of it to discuss this topic. We would recommend not changing the current
behavior for two reasons. First, there is a long precedent for returning
the first non-zero status, and returning a non-zero status if any process
causes the entire job to abort even if they all abort with status zero.
This is the only way the user (and any script they are using) can know that
an abort was ordered.

Second, we have looked at the OpenShmem standard and confirmed that
nothing is said there about returning zero status in such situations. We
don't know the source of this proposed requirement, but feel that it
shouldn't override the community's expected behavior.

Just our $0.02
Ralph

On Mon, Jun 3, 2013 at 1:45 PM, Ralph Castain r...@open-mpi.org wrote:

I'm leery of this patch - will discuss with other MPI folks as this
could cause problems for existing apps

Sent from my iPhone

On Jun 3, 2013, at 5:17 AM, Riebs, Andy andy.ri...@hp.com wrote:

Hi Victor,

If the patch is straight-forward, and the reason for it is clear,
patches sent to this list tend to be adopted quickly. However, since this
changes behavior that someone else may be counting on, it might get held
for the next major release if it is accepted.

Andy

*From:* Victor Kocheganov
[mailto:victor.kochega...@itseez.comvictor.kochega...@itseez.com]

*Sent:* Monday, June 03, 2013 6:48 AM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: PMI_Abort with zero value

OK, I see.

I've got SLURM sources with PMI in it and found out the reason of
strange behavior (I mean 0 rank process behaves different from others in
PMI_Abort()).

It seems clear to deal with it. Is it a complex procedure to provide a
minor fix to community (via patch)?

On Mon, Jun 3, 2013 at 12:18 AM, Brian Gilmer bfgil...@gmail.com
wrote:

We are able to use _exit() so I did not go any further. The behavior
of PMI_Abort() and exit() were both odd so I thought that my save you some
time. I am interested if you find another solution.

On Sun, Jun 2, 2013 at 3:39 AM, Victor Kocheganov
victor.kochega...@itseez.com wrote:

Thank you for the rapid answer! But still I have several questions,
please see inline.

On Fri, May 31, 2013 at 6:47 PM, Brian Gilmer bfgil...@gmail.com
wrote:

I addressed a similar problem with _exit(value).

[Victor Kocheganov] Where can I find it? I can not any clue in archive
of slurm-dev list (
http://dir.gmane.org/gmane.comp.distributed.slurm.develhttp://dir.gmane.org/gmane.comp.distributed.slurm.devel%22
)

Slurm will kill off the rest of the pe in a job step if one exits
with a non-zero code.

[Victor Kocheganov] Unfortunately it depends on slurm configurations
as far as I know (whether '-K' flag is set or not; it could be set
implicitly). So I can not rely on such a behavior...

The exit() function doesn't work under mx shmem because the exit()
function is overridden and does not propagate the exit code.
PMI_Abort(exit_code) uses exit() so in our case it always returns an exit
code of 9 regardless of the value of exit_code.

[Victor Kocheganov] And this is interesting, because I see that SLURM
always returns zero value to system when PMI_Abort(0,NULL) was invoked by
some process, except for the case when process with zero rank (PMI daemon
as I suspect) invoked it. Therefore a little hope still exists in my mind,
that I can make PMI_Abort work for me (return zero always in
case PMI_Abort(0,NULL)).

But you are saying that there is no hope in PMI_Abort(), am I
understand right? Do you have any other ways to make SLURM ( using PMI or
without it) terminate all

[slurm-dev] Re: PMI_Abort with zero value

2013-06-03 Thread Ralph Castain

I'm leery of this patch - will discuss with other MPI folks as this could
cause problems for existing apps

Sent from my iPhone

On Jun 3, 2013, at 5:17 AM, Riebs, Andy andy.ri...@hp.com wrote:

Hi Victor,

** **

If the patch is straight-forward, and the reason for it is clear, patches
sent to this list tend to be adopted quickly. However, since this changes
behavior that someone else may be counting on, it might get held for the
next major release if it is accepted.

** **

Andy

** **

*From:* Victor Kocheganov
[mailto:victor.kochega...@itseez.comvictor.kochega...@itseez.com]

*Sent:* Monday, June 03, 2013 6:48 AM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: PMI_Abort with zero value

** **

OK, I see.

** **

I've got SLURM sources with PMI in it and found out the reason of
strange behavior (I mean 0 rank process behaves different from others in
PMI_Abort()).

It seems clear to deal with it. Is it a complex procedure to provide a
minor fix to community (via patch)?

** **

On Mon, Jun 3, 2013 at 12:18 AM, Brian Gilmer bfgil...@gmail.com wrote:*
***

We are able to use _exit() so I did not go any further. The behavior of
PMI_Abort() and exit() were both odd so I thought that my save you some
time. I am interested if you find another solution.

** **

On Sun, Jun 2, 2013 at 3:39 AM, Victor Kocheganov
victor.kochega...@itseez.com wrote:

Thank you for the rapid answer! But still I have several questions, please
see inline.

** **

On Fri, May 31, 2013 at 6:47 PM, Brian Gilmer bfgil...@gmail.com wrote:*
***

I addressed a similar problem with _exit(value).

[Victor Kocheganov] Where can I find it? I can not any clue in archive of
slurm-dev list
(http://dir.gmane.org/gmane.comp.distributed.slurm.develhttp://dir.gmane.org/gmane.comp.distributed.slurm.devel%22
)

** **

Slurm will kill off the rest of the pe in a job step if one exits with a
non-zero code.

[Victor Kocheganov] Unfortunately it depends on slurm configurations as
far as I know (whether '-K' flag is set or not; it could be set
implicitly). So I can not rely on such a behavior...

** **

But you are saying that there is no hope in PMI_Abort(), am I understand
right? Do you have any other ways to make SLURM ( using PMI or without it)
terminate all the processes if one of them requested it (with passed exit
statuses off course)?

** **

On Fri, May 31, 2013 at 5:23 AM, Victor Kocheganov
victor.kochega...@itseez.com wrote:

Hello,

** **

I am SHMEM library developer and I am looking for approach to terminate
the whole slurm job with the specific exit status, when one of processes
initiate it. That is SHMEM library should have some API routine named
'globalexit(int status);', which terminates the job with other processes in
it with status exit code.

** **

The only way I found out is to use PMI_Abort(status), but it does not work
for zero status value, when PMI_Abort is invoked by zero process (daemon
for PMI, as I understand). Is it normal behavior or a bug? Could you please
help to find any other approaches, if this one does not seem proper for
slurm?

** **

Thank you in advance,

Victor Kocheganov.

--
Speak when you are angry--and you will make the best speech you'll ever
regret.
- Laurence J. Peter

** **

--
Speak when you are angry--and you will make the best speech you'll ever
regret.
- Laurence J. Peter

** **

[slurm-dev] Re: PMI_Abort with zero value

2013-06-03 Thread Ralph Castain

The OMPI developers were meeting this afternoon, so we took advantage of it
to discuss this topic. We would recommend not changing the current behavior
for two reasons. First, there is a long precedent for returning the first
non-zero status, and returning a non-zero status if any process causes the
entire job to abort even if they all abort with status zero. This is the
only way the user (and any script they are using) can know that an abort
was ordered.

Second, we have looked at the OpenShmem standard and confirmed that nothing
is said there about returning zero status in such situations. We don't know
the source of this proposed requirement, but feel that it shouldn't
override the community's expected behavior.

Just our $0.02
Ralph

On Mon, Jun 3, 2013 at 1:45 PM, Ralph Castain r...@open-mpi.org wrote:

I'm leery of this patch - will discuss with other MPI folks as this
could cause problems for existing apps

Sent from my iPhone

On Jun 3, 2013, at 5:17 AM, Riebs, Andy andy.ri...@hp.com wrote:

Hi Victor,

** **

If the patch is straight-forward, and the reason for it is clear, patches
sent to this list tend to be adopted quickly. However, since this changes
behavior that someone else may be counting on, it might get held for the
next major release if it is accepted.

** **

Andy

** **

*From:* Victor Kocheganov
[mailto:victor.kochega...@itseez.comvictor.kochega...@itseez.com]

*Sent:* Monday, June 03, 2013 6:48 AM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: PMI_Abort with zero value

** **

OK, I see.

** **

I've got SLURM sources with PMI in it and found out the reason of
strange behavior (I mean 0 rank process behaves different from others in
PMI_Abort()).

It seems clear to deal with it. Is it a complex procedure to provide a
minor fix to community (via patch)?

** **

On Mon, Jun 3, 2013 at 12:18 AM, Brian Gilmer bfgil...@gmail.com wrote:

We are able to use _exit() so I did not go any further. The behavior of
PMI_Abort() and exit() were both odd so I thought that my save you some
time. I am interested if you find another solution.

** **

On Sun, Jun 2, 2013 at 3:39 AM, Victor Kocheganov
victor.kochega...@itseez.com wrote:

Thank you for the rapid answer! But still I have several questions,
please see inline.

** **

On Fri, May 31, 2013 at 6:47 PM, Brian Gilmer bfgil...@gmail.com wrote:

I addressed a similar problem with _exit(value).

[Victor Kocheganov] Where can I find it? I can not any clue in archive of
slurm-dev list
(http://dir.gmane.org/gmane.comp.distributed.slurm.develhttp://dir.gmane.org/gmane.comp.distributed.slurm.devel%22
)

** **

Slurm will kill off the rest of the pe in a job step if one exits with
a non-zero code.

[Victor Kocheganov] Unfortunately it depends on slurm configurations as
far as I know (whether '-K' flag is set or not; it could be set
implicitly). So I can not rely on such a behavior...

** **

On Fri, May 31, 2013 at 5:23 AM, Victor Kocheganov
victor.kochega...@itseez.com wrote:

Hello,

** **

The only way I found out is to use PMI_Abort(status), but it does not
work for zero status value, when PMI_Abort is invoked by zero process
(daemon for PMI, as I understand). Is it normal behavior or a bug? Could
you please help to find any other approaches, if this one does not seem
proper for slurm?

** **

Thank you in advance,

Victor Kocheganov.

--
Speak when you are angry--and you will make the best speech you'll ever
regret.
- Laurence J. Peter

** **

--
Speak when you are angry--and you will make

[slurm-dev] Re: Scheduling in a heterogeneous cluster?

2013-01-05 Thread Ralph Castain



On Jan 5, 2013, at 3:33 AM, Ole Holm Nielsen ole.h.niel...@fysik.dtu.dk wrote:

 
 On 04-01-2013 18:28, Ralph Castain wrote:
 FWIW: I believe we have a mapper in Open MPI that does what you want - i.e., 
 it looks at the number of available cpus on each node, and maps the 
 processes to maximize the number of procs co-located on nodes. In your 
 described case, it would tend to favor the 16ppn nodes as that would provide 
 the best MPI performance, then move to the 8ppn nodes, etc.
 
 Yes, OpenMPI can be tweaked into different task layouts on the available set 
 of 
 nodes.  But the problem at hand is for the scheduler to allocate some set of 
 nodes, given that the user want a specific number of MPI tasks (say, 32), and 
 given that there are heterogeneous nodes available with 4, 8 or 16 cores.  As 
 Moe wrote, this flexibility would be hard to achieve with Slurm (with Maui 
 it's 
 impossible).

I think you may have misunderstood both Moe and I. I believe Moe was pointing 
out that you would need a new plugin to provide that capability, and I pointed 
out that we already have that algorithm in OMPI and could port it to a Slurm 
plugin if desired. That said, it would take some time to make that happen, so 
it may not resolve your problem.

 
 Regards,
 Ole

[slurm-dev] Re: Help installing slurm 2.5.0

2012-12-26 Thread Ralph Castain


Please disregard - apparently, I still had some files around from when I was 
playing with a POE simulator and Slurm found them. Removing those files and 
rebuilding Slurm solved the problem.


On Dec 25, 2012, at 6:00 PM, Ralph Castain r...@open-mpi.org wrote:

 
 Hi folks
 
 I have a Centos 6.1 cluster and am trying to install Slurm 2.5.0 from the 
 tarball. I successfully built Slurm, but am having trouble getting things to 
 run. I am able to get an allocation using salloc, but executing srun results 
 in a segfault with an error message indicating that it was unable to find 
 launch plugin for launch/poe.
 
 I have no idea why it is looking for a poe launch plugin. Any suggestions 
 would be welcome.
 
 Ralph

[slurm-dev] Re: Help installing slurm 2.5.0

2012-12-26 Thread Ralph Castain


Thanks Andy - per my other note, I resolved the problem. Appreciate your 
suggestions and time!

On Dec 26, 2012, at 7:36 AM, Riebs, Andy andy.ri...@hp.com wrote:

 Ralph,
 
 A couple of things may help to prime the pump here...
 
 1. Would you please send a copy of your slurm.conf file (sanitized, if 
 necessary) to the list
 2. Is it srun, or one of the slurm daemons, that segfaults? Assuming it's 
 srun, try executing it with srun -vv to generate other debugging 
 information, like where it is looking for the slurm.conf file and the slurm 
 daemons. (It's not unheard of for SLURM to be using a configuration file 
 other than what the user has been editing...)
 
 Andy
 
 -Original Message-
 From: Ralph Castain [mailto:r...@open-mpi.org] 
 Sent: Tuesday, December 25, 2012 8:37 PM
 To: slurm-dev
 Subject: [slurm-dev] Help installing slurm 2.5.0
 
 
 Hi folks
 
 I have a Centos 6.1 cluster and am trying to install Slurm 2.5.0 from the 
 tarball. I successfully built Slurm, but am having trouble getting things to 
 run. I am able to get an allocation using salloc, but executing srun results 
 in a segfault with an error message indicating that it was unable to find 
 launch plugin for launch/poe.
 
 I have no idea why it is looking for a poe launch plugin. Any suggestions 
 would be welcome.
 
 Ralph

77 matches

Mail list logo