[slurm-dev] Re: Wrong behaviour of "--tasks-per-node" flag

2016-12-02 Thread Miguel Couceiro


Hi.

I think I am also experiencing the same problem with SLURM 16.05.2.



slurm.conf:

SchedulerPort=7321
SchedulerRootFilter=1
SchedulerType=sched/backfill

SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory,CR_Pack_Nodes

NodeName=CompNode[001-020] NodeAddr=172.16.32.[1-20] Boards=1 
SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=193233 
TmpDisk=102350 State=UNKNOWN
PartitionName=unlimited Nodes=CompNode[001-020] Default=NO MaxNodes=20 
DefaultTime=UNLIMITED MaxTime=UNLIMITED Priority=100 Hidden=YES State=UP 
AllowGroups=slurmspecial




mm.batch:

#!/bin/bash

#SBATCH --time=24:00:00
#SBATCH --nodes=5
#SBATCH --ntasks=10
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=12
#SBATCH --licenses=comsol@headnode
#SBATCH --partition=unlimited
#SBATCH --exclusive
#SBATCH --output=~/Documents/COMSOL/mm.out
#SBATCH --error=~/Documents/COMSOL/mm.err

module load Programs/comsol-5.2a

comsol -nn $SLURM_NTASKS -nnhost $SLURM_NTASKS_PER_NODE -np 
$SLURM_CPUS_PER_TASK -numasets 2 -mpmode owner -mpibootstrap slurm 
-mpifabrics ofa:ofa batch -job b1 -inputfile ~/Documents/COMSOL/mm.mph 
-outputfile ~/Documents/COMSOL/mm-results.mph -batchlog 
~/Documents/COMSOL/mm.log




mm.err:

srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't 
match the requested tasks 5 with the number of requested nodes 5. 
Ignoring --ntasks-per-node.




mm.out:

[0] MPI startup(): Multi-threaded optimized library
[9] MPID_nem_ofacm_init(): Init
[1] MPID_nem_ofacm_init(): Init
[6] MPID_nem_ofacm_init(): Init
[4] MPID_nem_ofacm_init(): Init
[0] MPID_nem_ofacm_init(): Init
[8] MPID_nem_ofacm_init(): Init
[5] MPID_nem_ofacm_init(): Init
[7] MPID_nem_ofacm_init(): Init
[3] MPID_nem_ofacm_init(): Init
[2] MPID_nem_ofacm_init(): Init
[9] MPI startup(): ofa data transfer mode
[1] MPI startup(): ofa data transfer mode
[6] MPI startup(): ofa data transfer mode
[4] MPI startup(): ofa data transfer mode
[0] MPI startup(): ofa data transfer mode
[8] MPI startup(): ofa data transfer mode
[7] MPI startup(): ofa data transfer mode
[5] MPI startup(): ofa data transfer mode
[3] MPI startup(): ofa data transfer mode
[2] MPI startup(): ofa data transfer mode
[0] MPI startup(): RankPid  Node namePin cpu
[0] MPI startup(): 0   15011CompNode001 
{1,3,5,7,9,11,13,15,17,19,21,23}
[0] MPI startup(): 1   15012CompNode001 
{0,2,4,6,8,10,12,14,16,18,20,22}
[0] MPI startup(): 2   45695CompNode002 
{1,3,5,7,9,11,13,15,17,19,21,23}
[0] MPI startup(): 4   17504CompNode003 
{1,3,5,7,9,11,13,15,17,19,21,23}
[0] MPI startup(): 5   17505CompNode003 
{0,2,4,6,8,10,12,14,16,18,20,22}
[0] MPI startup(): 6   8158 CompNode004 
{1,3,5,7,9,11,13,15,17,19,21,23}
[0] MPI startup(): 7   8159 CompNode004 
{0,2,4,6,8,10,12,14,16,18,20,22}
[0] MPI startup(): 8   45973CompNode005 
{1,3,5,7,9,11,13,15,17,19,21,23}
[0] MPI startup(): 9   45974CompNode005 
{0,2,4,6,8,10,12,14,16,18,20,22}

Node 0 is running on host: CompNode001
Node 0 has address: CompNode001.laced.ib
Node 1 is running on host: CompNode001
Node 1 has address: CompNode001.laced.ib
Node 2 is running on host: CompNode002
Node 2 has address: CompNode002.laced.ib
Node 3 is running on host: CompNode002
Node 3 has address: CompNode002.laced.ib
Node 4 is running on host: CompNode003
Node 4 has address: CompNode003.laced.ib
Node 5 is running on host: CompNode003
Node 5 has address: CompNode003.laced.ib
Node 6 is running on host: CompNode004
Node 6 has address: CompNode004.laced.ib
Node 7 is running on host: CompNode004
Node 7 has address: CompNode004.laced.ib
Node 8 is running on host: CompNode005
Node 8 has address: CompNode005.laced.ib
Node 9 is running on host: CompNode005
Node 9 has address: CompNode005.laced.ib


Regards,
Miguel



On 10/21/2016 02:54 PM, Manuel Rodríguez Pascual wrote:

Wrong behaviour of "--tasks-per-node" flag
Hi all,

I am having the weirdest error ever.  I am pretty sure this is a bug. 
I have reproduced the error in latest slurm commit (slurm 
17.02.0-0pre2,  commit 406d3fe429ef6b694f30e19f69acf989e65d7509 ) and 
in slurm 16.05.5 branch. It does NOT happen in slurm 15.08.12 .


My cluster is composed by 8 nodes, each with 2 sockets, each with 8 
cores. Slurm.conf content is


SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear  #DEDICATED NODES
NodeName=acme[11-14,21-24] CPUs=16 Sockets=2 CoresPerSocket=8 
ThreadsPerCore=1 State=UNKNOWN


I am running a simple hello World parallel code. It is submitted as 
"sbatch --ntasks=X --tasks-per-node=Y myScript.sh ". The problem is 
that, depending on the values of X and Y, Slurm performs a wrong 
opperation and returns an error.


"
sbatch --ntasks=8 --tasks-per-node=2 myScript.sh
srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't 
match the requested tasks 4 with the number of requested nodes 4. 
Ignoring --ntasks-per-node.

"
Note that  I did not request 4 but 8 tasks, and I did not request any 
n

[slurm-dev] Re: Slurm license management question

2016-12-02 Thread Miguel Couceiro


Hi Loris,

I am interested in the script you mentioned. Could you please post the code?

Best regards,
Miguel


On 10/27/2016 08:22 AM, Loris Bennett wrote:

Baker D.J.  writes:


Hello,

Looking at the Slurm documentation I see that it is possible to handle basic
license management (this is the link http://slurm.schedmd.com/licenses.html). In
other words software licenses can be treated as a resource, however things
appear to be fairly rudimentary at the moment – at least that’s my impression.
We are used to doing license management in moab, and if we don’t have that
properly implemented is it not the end of the world, however not ideal.

One situation that we would like to be able to deal with is a FlexLM 3 server
redundancy situation. So, for example, our Comsol licenses are served out in
this fashion. Is this something that slurm can deal with, and, if so, how can it
be done? Any advice including slurm’s short comings and/or future plans in this
respect would be useful, please.

Best regards,

David

We have licenses, such as Intel compiler licenses, which can be used
both interactively outside the queuing system and within Slurm jobs.

We use a script which parses the output of the FlexLM manager and
modifies a reservation in which the licenses are defined.  This is run
as a cron job once a minute.  It's a bit of a kludge and obviously won't
work well if there is a lot of contention for licenses.

I can post the code if anyone is interested.

Cheers,

Loris



[slurm-dev] Re: Restrict access for a user group to certain nodes

2016-12-02 Thread Felix Willenborg

Thanks Brian and Magnus for your answers. Reservations was exactly what
I was looking for. Thank you very much!

Have a nice day,
Felix

Am 01.12.2016 um 16:14 schrieb Magnus Jonsson:
>
> Hi!
>
> You could either setup a partition for your tests with group
> restrictions or you can use the reservation feature depending on your
> exact use case.
>
> /Magnus
>
> On 2016-12-01 15:54, Felix Willenborg wrote:
>>
>> Dear everybody,
>>
>> I'd like to restrict submissions from a certain user group or allow only
>> one certain user group to submit jobs to certain nodes. Does Slurm offer
>> groups which can handle such an occassion? It'd be prefered if there is
>> a linux user group support, because this would save time setting up a
>> new user group environment.
>>
>> The intention is that only administrators can submit jobs to those
>> certain nodes to perform some tests, which might be disturbed by users
>> submitting their jobs to those nodes. Various Search Engines didn't
>> offer answers to my question, which is why I'm writing you here.
>>
>> Looking forward to some answers!
>>
>> Best,
>> Felix Willenborg
>>
>


[slurm-dev] Re: slurmdbd start error

2016-12-02 Thread Paddy Doyle

Hi Deborah,

On Thu, Dec 01, 2016 at 08:12:47AM -0800, Crocker, Deborah wrote:

> Has anyone seen this error - slurmdbd will not start:
> 
> 2016-12-01T10:07:54.946] debug:  slurmdbd: slurm_open_msg_conn to uahpc:6819: 
> Connection refused
> [2016-12-01T10:07:54.946] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: 
> Connection refused
> 
> This was a running system and we just pushed out an update from 15.8.10 to 
> 15.8.12

As a wild guess, was the old daemon still running (and still listening on port
6819)?

Paddy

-- 
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/


[slurm-dev] Re: Install Slurm in a mic

2016-12-02 Thread Paddy Doyle

Hi Jose,

On Thu, Dec 01, 2016 at 09:36:16AM -0800, Jose Antonio wrote:

> 
> Hello everybody,
> 
> I would like to know if there is any way to install Munge and Slurm as
> daemons inside a Xeon Phi, as if it would be an independent node like
> any other else. The aim is to run native applications, as we are not
> interested in the offload mode.

This old thread might be of help:

https://groups.google.com/forum/#!searchin/slurm-devel/xeon$20phi|sort:relevance/slurm-devel/0bnMLfV1qA8/de1a89yEl4sJ

Paddy

-- 
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/