Re: [slurm-users] options for ResumeProgram

2019-05-20 Thread Chris Samuel

On 20/5/19 2:11 pm, Brian Andrus wrote:

I know the argument passed to ResumeProgram is the node to be started, 
but is there any way to access job info from within that script?


I've no idea, but you could try dumping the environment with env (or 
setenv if you're using csh) from the script that Slurm is calling to do 
this work to see if anything is hiding there.


All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Using cgroups to hide GPUs on a shared controller/node

2019-05-20 Thread Dave Evans
Do you have that resource handy? I looked into the cgroups documentation
but I see very little on tutorials for modifying the permissions.

On Mon, May 20, 2019 at 2:45 AM John Hearns  wrote:

> Two replies here.
> First off for normal user logins you can direct them into a cgroup - I
> looked into this about a year ago and it was actually quite easy.
> As I remember there is a service or utility available which does just
> that. Of course the user cgroup would not have
>
> Expanding on my theme, it is probably a good idea then to have all the
> system processes contained in a 'boot cpuset' - is at system boot time
> allocate a small number of cores to the system dacemons, Slurm processes
> and probably the user login sessions.
> Thus freeing up the other CPUs for batch jobs exclusively.
>
> Also you could try simply setting CUDA_VISIBLE_DEVICES to Null in one of
> the system wide login scripts,
>
>
>
>
>
>
>
> On Mon, 20 May 2019 at 08:38, Nathan Harper 
> wrote:
>
>> This doesn't directly answer your question, but in Feb last year on the
>> ML there was a discussion about limiting user resources on login node
>> (Stopping compute usage on login nodes).Some of the suggestions
>> included the use of cgroups to do so, and it's possible that those methods
>> could be extended to limit access to GPUs, so it might be worth looking
>> into.
>>
>> On Sat, 18 May 2019 at 00:28, Dave Evans  wrote:
>>
>>>
>>> We are using a single system "cluster" and want some control of fair-use
>>> with the GPUs. The sers are not supposed to be able to use the GPUs until
>>> they have allocated the resources through slurm. We have no head node, so
>>> slurmctld, slurmdbd, and slurmd are all run on the same system.
>>>
>>> I have a configuration working now such that the GPUs can be scheduled
>>> and allocated.
>>> However logging into the system before allocating GPUs gives full access
>>> to all of them.
>>>
>>> I would like to configure slurm cgroups to disable access to GPUs until
>>> they have been allocated.
>>>
>>> On first login, I get:
>>> nvidia-smi -q | grep UUID
>>> GPU UUID:
>>> GPU-6076ce0a-bc03-a53c-6616-0fc727801c27
>>> GPU UUID:
>>> GPU-5620ec48-7d76-0398-9cc1-f1fa661274f3
>>> GPU UUID:
>>> GPU-176d0514-0cf0-df71-e298-72d15f6dcd7f
>>> GPU UUID:
>>> GPU-af03c80f-6834-cb8c-3133-2f645975f330
>>> GPU UUID:
>>> GPU-ef10d039-a432-1ac1-84cf-3bb79561c0d3
>>> GPU UUID:
>>> GPU-38168510-c356-33c9-7189-4e74b5a1d333
>>> GPU UUID:
>>> GPU-3428f78d-ae91-9a74-bcd6-8e301c108156
>>> GPU UUID:
>>> GPU-c0a831c0-78d6-44ec-30dd-9ef5874059a5
>>>
>>>
>>> And running from the queue:
>>> srun -N 1 --gres=gpu:2 nvidia-smi -q | grep UUID
>>> GPU UUID:
>>> GPU-6076ce0a-bc03-a53c-6616-0fc727801c27
>>> GPU UUID:
>>> GPU-5620ec48-7d76-0398-9cc1-f1fa661274f3
>>>
>>>
>>> Pastes of my config files are:
>>> ## slurm.conf ##
>>> https://pastebin.com/UxP67cA8
>>>
>>>
>>> *## cgroup.conf ##*
>>> CgroupAutomount=yes
>>> CgroupReleaseAgentDir="/etc/slurm/cgroup"
>>>
>>> ConstrainCores=yes
>>> ConstrainDevices=yes
>>> ConstrainRAMSpace=yes
>>> #TaskAffinity=yes
>>>
>>> *## cgroup_allowed_devices_file.conf ## *
>>> /dev/null
>>> /dev/urandom
>>> /dev/zero
>>> /dev/sda*
>>> /dev/cpu/*/*
>>> /dev/pts/*
>>> /dev/nvidia*
>>>
>>
>>
>> --
>> *Nathan Harper* // IT Systems Lead
>>
>> *e: *nathan.har...@cfms.org.uk   *t*: 0117 906 1104  *m*:  0787 551 0891
>>  *w: *www.cfms.org.uk
>> CFMS Services Ltd // Bristol & Bath Science Park // Dirac Crescent // 
>> Emersons
>> Green // Bristol // BS16 7FR
>>
>> CFMS Services Ltd is registered in England and Wales No 05742022 - a
>> subsidiary of CFMS Ltd
>> CFMS Services Ltd registered office // 43 Queens Square // Bristol //
>> BS1 4QP
>>
>


[slurm-users] options for ResumeProgram

2019-05-20 Thread Brian Andrus

All,

I know the argument passed to ResumeProgram is the node to be started, 
but is there any way to access job info from within that script?


In particular, the number of nodes and cores actually requested.


Brian Andrus




Re: [slurm-users] MaxTRESRunMinsPU not yet enabled - similar options?

2019-05-20 Thread Fulcomer, Samuel
On Mon, May 20, 2019 at 2:59 PM  wrote:

>
>
>
> I did test setting GrpTRESRunMins=cpu=N for each user + account
> association, and that does appear to work. Does anyone know of any other
> solutions to this issue?


No. Your solution is what we currently do. A "...PU" would be a nice, tidy
addition for the QOS entity.

regards,
s

>
> Thanks,
> Jesse Stroik
>
>


Re: [slurm-users] Access/permission denied

2019-05-20 Thread John Hearns
Why are you sshing into the compute node compute-0-2  ???
On the head node named rocks7:

srun -c 1 --partition RUBY --account y8 --mem=1G xclock

On Mon, 20 May 2019 at 16:07, Mahmood Naderan  wrote:

> Hi
> Although proper configuration has been defined as below
>
> [root@rocks7 software]# grep RUBY /etc/slurm/parts
> PartitionName=RUBY AllowAccounts=y4,y8 Nodes=compute-0-[1-4]
> [root@rocks7 software]# sacctmgr list association
> format=account,"user%20",partition,grptres,maxwall | grep kouhikamali3
>  local kouhikamali3cpu=16,mem=1+
> y8 kouhikamali3   ruby cpu=16,mem=1+
> [root@rocks7 software]# systemctl restart slurmd
> [root@rocks7 software]# systemctl restart slurmctld
> [root@rocks7 software]#
>
>
>
> The user is not able to run srun on a specified node. See:
>
>
> [kouhikamali3@rocks7 ~]$ ssh -Y compute-0-2
> ...
> [kouhikamali3@compute-0-2 ~]$ srun -c 1 --partition RUBY --account y8
> --mem=1G xclock
> srun: error: Unable to allocate resources: Access/permission denied
>
>
> Any thought?
>
> Regards,
> Mahmood
>
>
>


[slurm-users] Access/permission denied

2019-05-20 Thread Mahmood Naderan
Hi
Although proper configuration has been defined as below

[root@rocks7 software]# grep RUBY /etc/slurm/parts
PartitionName=RUBY AllowAccounts=y4,y8 Nodes=compute-0-[1-4]
[root@rocks7 software]# sacctmgr list association
format=account,"user%20",partition,grptres,maxwall | grep kouhikamali3
 local kouhikamali3cpu=16,mem=1+
y8 kouhikamali3   ruby cpu=16,mem=1+
[root@rocks7 software]# systemctl restart slurmd
[root@rocks7 software]# systemctl restart slurmctld
[root@rocks7 software]#



The user is not able to run srun on a specified node. See:


[kouhikamali3@rocks7 ~]$ ssh -Y compute-0-2
...
[kouhikamali3@compute-0-2 ~]$ srun -c 1 --partition RUBY --account y8
--mem=1G xclock
srun: error: Unable to allocate resources: Access/permission denied


Any thought?

Regards,
Mahmood


Re: [slurm-users] Using cgroups to hide GPUs on a shared controller/node

2019-05-20 Thread John Hearns
Two replies here.
First off for normal user logins you can direct them into a cgroup - I
looked into this about a year ago and it was actually quite easy.
As I remember there is a service or utility available which does just that.
Of course the user cgroup would not have

Expanding on my theme, it is probably a good idea then to have all the
system processes contained in a 'boot cpuset' - is at system boot time
allocate a small number of cores to the system dacemons, Slurm processes
and probably the user login sessions.
Thus freeing up the other CPUs for batch jobs exclusively.

Also you could try simply setting CUDA_VISIBLE_DEVICES to Null in one of
the system wide login scripts,







On Mon, 20 May 2019 at 08:38, Nathan Harper 
wrote:

> This doesn't directly answer your question, but in Feb last year on the ML
> there was a discussion about limiting user resources on login node
> (Stopping compute usage on login nodes).Some of the suggestions
> included the use of cgroups to do so, and it's possible that those methods
> could be extended to limit access to GPUs, so it might be worth looking
> into.
>
> On Sat, 18 May 2019 at 00:28, Dave Evans  wrote:
>
>>
>> We are using a single system "cluster" and want some control of fair-use
>> with the GPUs. The sers are not supposed to be able to use the GPUs until
>> they have allocated the resources through slurm. We have no head node, so
>> slurmctld, slurmdbd, and slurmd are all run on the same system.
>>
>> I have a configuration working now such that the GPUs can be scheduled
>> and allocated.
>> However logging into the system before allocating GPUs gives full access
>> to all of them.
>>
>> I would like to configure slurm cgroups to disable access to GPUs until
>> they have been allocated.
>>
>> On first login, I get:
>> nvidia-smi -q | grep UUID
>> GPU UUID:
>> GPU-6076ce0a-bc03-a53c-6616-0fc727801c27
>> GPU UUID:
>> GPU-5620ec48-7d76-0398-9cc1-f1fa661274f3
>> GPU UUID:
>> GPU-176d0514-0cf0-df71-e298-72d15f6dcd7f
>> GPU UUID:
>> GPU-af03c80f-6834-cb8c-3133-2f645975f330
>> GPU UUID:
>> GPU-ef10d039-a432-1ac1-84cf-3bb79561c0d3
>> GPU UUID:
>> GPU-38168510-c356-33c9-7189-4e74b5a1d333
>> GPU UUID:
>> GPU-3428f78d-ae91-9a74-bcd6-8e301c108156
>> GPU UUID:
>> GPU-c0a831c0-78d6-44ec-30dd-9ef5874059a5
>>
>>
>> And running from the queue:
>> srun -N 1 --gres=gpu:2 nvidia-smi -q | grep UUID
>> GPU UUID:
>> GPU-6076ce0a-bc03-a53c-6616-0fc727801c27
>> GPU UUID:
>> GPU-5620ec48-7d76-0398-9cc1-f1fa661274f3
>>
>>
>> Pastes of my config files are:
>> ## slurm.conf ##
>> https://pastebin.com/UxP67cA8
>>
>>
>> *## cgroup.conf ##*
>> CgroupAutomount=yes
>> CgroupReleaseAgentDir="/etc/slurm/cgroup"
>>
>> ConstrainCores=yes
>> ConstrainDevices=yes
>> ConstrainRAMSpace=yes
>> #TaskAffinity=yes
>>
>> *## cgroup_allowed_devices_file.conf ## *
>> /dev/null
>> /dev/urandom
>> /dev/zero
>> /dev/sda*
>> /dev/cpu/*/*
>> /dev/pts/*
>> /dev/nvidia*
>>
>
>
> --
> *Nathan Harper* // IT Systems Lead
>
> *e: *nathan.har...@cfms.org.uk   *t*: 0117 906 1104  *m*:  0787 551 0891
> *w: *www.cfms.org.uk
> CFMS Services Ltd // Bristol & Bath Science Park // Dirac Crescent // Emersons
> Green // Bristol // BS16 7FR
>
> CFMS Services Ltd is registered in England and Wales No 05742022 - a
> subsidiary of CFMS Ltd
> CFMS Services Ltd registered office // 43 Queens Square // Bristol // BS1
> 4QP
>


Re: [slurm-users] Using cgroups to hide GPUs on a shared controller/node

2019-05-20 Thread Nathan Harper
This doesn't directly answer your question, but in Feb last year on the ML
there was a discussion about limiting user resources on login node
(Stopping compute usage on login nodes).Some of the suggestions
included the use of cgroups to do so, and it's possible that those methods
could be extended to limit access to GPUs, so it might be worth looking
into.

On Sat, 18 May 2019 at 00:28, Dave Evans  wrote:

>
> We are using a single system "cluster" and want some control of fair-use
> with the GPUs. The sers are not supposed to be able to use the GPUs until
> they have allocated the resources through slurm. We have no head node, so
> slurmctld, slurmdbd, and slurmd are all run on the same system.
>
> I have a configuration working now such that the GPUs can be scheduled and
> allocated.
> However logging into the system before allocating GPUs gives full access
> to all of them.
>
> I would like to configure slurm cgroups to disable access to GPUs until
> they have been allocated.
>
> On first login, I get:
> nvidia-smi -q | grep UUID
> GPU UUID:
> GPU-6076ce0a-bc03-a53c-6616-0fc727801c27
> GPU UUID:
> GPU-5620ec48-7d76-0398-9cc1-f1fa661274f3
> GPU UUID:
> GPU-176d0514-0cf0-df71-e298-72d15f6dcd7f
> GPU UUID:
> GPU-af03c80f-6834-cb8c-3133-2f645975f330
> GPU UUID:
> GPU-ef10d039-a432-1ac1-84cf-3bb79561c0d3
> GPU UUID:
> GPU-38168510-c356-33c9-7189-4e74b5a1d333
> GPU UUID:
> GPU-3428f78d-ae91-9a74-bcd6-8e301c108156
> GPU UUID:
> GPU-c0a831c0-78d6-44ec-30dd-9ef5874059a5
>
>
> And running from the queue:
> srun -N 1 --gres=gpu:2 nvidia-smi -q | grep UUID
> GPU UUID:
> GPU-6076ce0a-bc03-a53c-6616-0fc727801c27
> GPU UUID:
> GPU-5620ec48-7d76-0398-9cc1-f1fa661274f3
>
>
> Pastes of my config files are:
> ## slurm.conf ##
> https://pastebin.com/UxP67cA8
>
>
> *## cgroup.conf ##*
> CgroupAutomount=yes
> CgroupReleaseAgentDir="/etc/slurm/cgroup"
>
> ConstrainCores=yes
> ConstrainDevices=yes
> ConstrainRAMSpace=yes
> #TaskAffinity=yes
>
> *## cgroup_allowed_devices_file.conf ## *
> /dev/null
> /dev/urandom
> /dev/zero
> /dev/sda*
> /dev/cpu/*/*
> /dev/pts/*
> /dev/nvidia*
>


-- 
*Nathan Harper* // IT Systems Lead

*e: *nathan.har...@cfms.org.uk   *t*: 0117 906 1104  *m*:  0787 551 0891
*w: *www.cfms.org.uk
CFMS Services Ltd // Bristol & Bath Science Park // Dirac Crescent // Emersons
Green // Bristol // BS16 7FR

CFMS Services Ltd is registered in England and Wales No 05742022 - a
subsidiary of CFMS Ltd
CFMS Services Ltd registered office // 43 Queens Square // Bristol // BS1
4QP