Re: [slurm-users] [EXT] Re: systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-31 Thread Sean Crosby
Did you build Slurm yourself from source? If so, when you build from source, on 
that node, you need to have the munge-devel package installed (munge-devel on 
EL systems, libmunge-dev on Debian)

You then need to set up munge with a shared munge key between the nodes, and 
have the munge daemon running.

This is all detailed on Ole's wiki which was linked previously - 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation

Sean

From: slurm-users  on behalf of Nousheen 

Sent: Tuesday, 1 February 2022 15:56
To: Slurm User Community List 
Subject: [EXT] Re: [slurm-users] systemctl enable slurmd.service Failed to 
execute operation: No such file or directory

External email: Please exercise caution


Dear Ole and Hermann,

I have reinstalled slurm from scratch now following this link:

The error remains the same. Kindly guide me where will i find this cred/munge 
plugin. Please help me resolve this issue.

[root@exxact slurm]# slurmd -C
NodeName=exxact CPUs=12 Boards=1 SocketsPerBoard=1 CoresPerSocket=6 
ThreadsPerCore=2 RealMemory=31889
UpTime=0-22:06:45
[root@exxact slurm]# systemctl enable slurmctld.service
[root@exxact slurm]# systemctl start slurmctld.service
[root@exxact slurm]# systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor 
preset: disabled)
   Active: failed (Result: exit-code) since Tue 2022-02-01 09:46:20 PKT; 8s ago
  Process: 27530 ExecStart=/usr/local/sbin/slurmctld -D -s $SLURMCTLD_OPTIONS 
(code=exited, status=1/FAILURE)
 Main PID: 27530 (code=exited, status=1/FAILURE)

Feb 01 09:46:20 exxact systemd[1]: Started Slurm controller daemon.
Feb 01 09:46:20 exxact systemd[1]: slurmctld.service: main process exited, ...RE
Feb 01 09:46:20 exxact systemd[1]: Unit slurmctld.service entered failed state.
Feb 01 09:46:20 exxact systemd[1]: slurmctld.service failed.


[root@exxact slurm]# /usr/local/sbin/slurmctld -D
slurmctld: slurmctld version 21.08.5 started on cluster cluster194
slurmctld: error: Couldn't find the specified plugin name for cred/munge 
looking at all files
slurmctld: error: cannot find cred plugin for cred/munge
slurmctld: error: cannot create cred context for cred/munge
slurmctld: fatal: slurm_cred_creator_ctx_create((null)): Operation not permitted



Best Regards,
Nousheen Parvaiz
[https://mailfoogae.appspot.com/t?sender=abm91c2hlZW5wYXJ2YWl6QGdtYWlsLmNvbQ%3D%3D=zerocontent=7435e410-9fbe-4cc6-acf8-889877b5c100]ᐧ

On Tue, Feb 1, 2022 at 9:06 AM Nousheen 
mailto:nousheenparv...@gmail.com>> wrote:
Dear Ole,

Thank you for your response.
I am doing it again using your suggested link.

Best Regards,
Nousheen Parvaiz


[https://mailfoogae.appspot.com/t?sender=abm91c2hlZW5wYXJ2YWl6QGdtYWlsLmNvbQ%3D%3D=zerocontent=9a84a710-e6c1-4912-a461-6103eb630f96]ᐧ

On Mon, Jan 31, 2022 at 2:07 PM Ole Holm Nielsen 
mailto:ole.h.niel...@fysik.dtu.dk>> wrote:
Hi Nousheen,

I recommend you again to follow the steps for installing Slurm on a CentOS
7 cluster:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation

Maybe you will need to start installation from scratch, but the steps are
guaranteed to work if followed correctly.

IHTH,
Ole

On 1/31/22 06:23, Nousheen wrote:
> The same error shows up on compute node which is as follows:
>
> [root@c103008 ~]# systemctl enable slurmd.service
> [root@c103008 ~]# systemctl start slurmd.service
> [root@c103008 ~]# systemctl status slurmd.service
> ● slurmd.service - Slurm node daemon
> Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor
> preset: disabled)
> Active: failed (Result: exit-code) since Mon 2022-01-31 00:22:42 EST;
> 2s ago
>Process: 11505 ExecStart=/usr/local/sbin/slurmd -D -s $SLURMD_OPTIONS
> (code=exited, status=203/EXEC)
>   Main PID: 11505 (code=exited, status=203/EXEC)
>
> Jan 31 00:22:42 c103008 systemd[1]: Started Slurm node daemon.
> Jan 31 00:22:42 c103008 systemd[1]: slurmd.service: main process exited,
> code=exited, status=203/EXEC
> Jan 31 00:22:42 c103008 systemd[1]: Unit slurmd.service entered failed state.
> Jan 31 00:22:42 c103008 systemd[1]: slurmd.service failed.
>
>
> Best Regards,
> Nousheen Parvaiz
>
>
> ᐧ
>
> On Mon, Jan 31, 2022 at 10:08 AM Nousheen 
> mailto:nousheenparv...@gmail.com>
> >> wrote:
>
> Dear Jeffrey,
>
> Thank you for your response. I have followed the steps as instructed.
> After the copying the files to their respective locations "systemctl
> status slurmctld.service" command gives me an error as follows:
>
> (base) [nousheen@exxact system]$ systemctl daemon-reload
> (base) [nousheen@exxact system]$ systemctl enable slurmctld.service
> (base) [nousheen@exxact system]$ systemctl start slurmctld.service
> (base) [nousheen@exxact system]$ systemctl status slurmctld.service
> ● slurmctld.service - Slurm controller 

Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Ole Holm Nielsen

One ting to be aware about when setting partition states to down:

* Setting partition state=down will be reset if slurmctld is restarted.

Read the slurmctld man-page under the -R parameter.  So it's better not to 
restart slurmctld during the downtime.


/Ole

On 2/1/22 08:11, Ole Holm Nielsen wrote:
Login nodes being down doesn't affect Slurm jobs at all (except if you run 
slurmctld/slurmdbd on the login node ;-)


To stop new jobs from being scheduled for running, mark all partitions 
down.  This is useful when recovering the cluster from a power or cooling 
downtime, for example.


I wrote a handy little script "schedjobs down" available from
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs
This loops over all partitions in the cluster and marks them down.  When 
the cluster is OK again, run "schedjobs up".


/Ole

On 2/1/22 07:14, Sid Young wrote:
Brian / Christopher, that looks like a good process, thanks guys, I will 
do some testing and let you know.


if I mark a partition down and it has running jobs, what happens to 
those jobs, do they keep running?



Sid Young
W: https://off-grid-engineering.com 
W: (personal) https://sidyoung.com/ 
W: (personal) https://z900collector.wordpress.com/ 




On Tue, Feb 1, 2022 at 3:27 PM Brian Andrus > wrote:


    One possibility:

    Sounds like your concern is folks with interactive jobs from the login
    node that are running under screen/tmux.

    That being the case, you need running jobs to end and not allow new
    users to start tmux sessions.

    Definitely doing 'scontrol update state=down partition=' for each
    partition. Also:

    touch /etc/nologin

    That will prevent new logins.

    Send a message to all active folks

    wall "system going down at XX:XX, please end your sessions"

    Then wait for folks to drain off your login node and do your stuff.

    When done, remove the /etc/nologin file and folks will be able to
    login again.

    Brian Andrus

    On 1/31/2022 9:18 PM, Sid Young wrote:




    Sid Young
    W: https://off-grid-engineering.com 
    W: (personal) https://sidyoung.com/ 
    W: (personal) https://z900collector.wordpress.com/
    


    On Tue, Feb 1, 2022 at 3:02 PM Christopher Samuel mailto:ch...@csamuel.org>> wrote:

    On 1/31/22 4:41 pm, Sid Young wrote:

    > I need to replace a faulty DIMM chim in our login node so I
    need to stop
    > new jobs being kicked off while letting the old ones end.
    >
    > I thought I would just set all nodes to drain to stop new jobs
    from
    > being kicked off...

    That would basically be the way, but is there any reason why
    compute
    jobs shouldn't start whilst the login node is down?


    My concern was to keep the running jobs going and stop new jobs, so
    when the last running job ends,
     I could reboot the login node knowing that any terminal windows
    "screen"/"tmux" sessions would effectively
    have ended as the job(s) had now ended

    I'm not sure if there was an accepted procedure or best practice way
    to tackle shutting down the Login node for this use case.

    On the bright side I am down to two jobs left so any day now :)





Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Ole Holm Nielsen
Login nodes being down doesn't affect Slurm jobs at all (except if you run 
slurmctld/slurmdbd on the login node ;-)


To stop new jobs from being scheduled for running, mark all partitions 
down.  This is useful when recovering the cluster from a power or cooling 
downtime, for example.


I wrote a handy little script "schedjobs down" available from
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs
This loops over all partitions in the cluster and marks them down.  When 
the cluster is OK again, run "schedjobs up".


/Ole

On 2/1/22 07:14, Sid Young wrote:
Brian / Christopher, that looks like a good process, thanks guys, I will 
do some testing and let you know.


if I mark a partition down and it has running jobs, what happens to those 
jobs, do they keep running?



Sid Young
W: https://off-grid-engineering.com 
W: (personal) https://sidyoung.com/ 
W: (personal) https://z900collector.wordpress.com/ 




On Tue, Feb 1, 2022 at 3:27 PM Brian Andrus > wrote:


One possibility:

Sounds like your concern is folks with interactive jobs from the login
node that are running under screen/tmux.

That being the case, you need running jobs to end and not allow new
users to start tmux sessions.

Definitely doing 'scontrol update state=down partition=' for each
partition. Also:

touch /etc/nologin

That will prevent new logins.

Send a message to all active folks

wall "system going down at XX:XX, please end your sessions"

Then wait for folks to drain off your login node and do your stuff.

When done, remove the /etc/nologin file and folks will be able to
login again.

Brian Andrus

On 1/31/2022 9:18 PM, Sid Young wrote:




Sid Young
W: https://off-grid-engineering.com 
W: (personal) https://sidyoung.com/ 
W: (personal) https://z900collector.wordpress.com/



On Tue, Feb 1, 2022 at 3:02 PM Christopher Samuel mailto:ch...@csamuel.org>> wrote:

On 1/31/22 4:41 pm, Sid Young wrote:

> I need to replace a faulty DIMM chim in our login node so I
need to stop
> new jobs being kicked off while letting the old ones end.
>
> I thought I would just set all nodes to drain to stop new jobs
from
> being kicked off...

That would basically be the way, but is there any reason why
compute
jobs shouldn't start whilst the login node is down?


My concern was to keep the running jobs going and stop new jobs, so
when the last running job ends,
 I could reboot the login node knowing that any terminal windows
"screen"/"tmux" sessions would effectively
have ended as the job(s) had now ended

I'm not sure if there was an accepted procedure or best practice way
to tackle shutting down the Login node for this use case.

On the bright side I am down to two jobs left so any day now :)




Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Sid Young
Brian / Christopher, that looks like a good process, thanks guys, I will do
some testing and let you know.

if I mark a partition down and it has running jobs, what happens to those
jobs, do they keep running?


Sid Young
W: https://off-grid-engineering.com
W: (personal) https://sidyoung.com/
W: (personal) https://z900collector.wordpress.com/


On Tue, Feb 1, 2022 at 3:27 PM Brian Andrus  wrote:

> One possibility:
>
> Sounds like your concern is folks with interactive jobs from the login
> node that are running under screen/tmux.
>
> That being the case, you need running jobs to end and not allow new users
> to start tmux sessions.
>
> Definitely doing 'scontrol update state=down partition=' for each
> partition. Also:
>
> touch /etc/nologin
>
> That will prevent new logins.
>
> Send a message to all active folks
>
> wall "system going down at XX:XX, please end your sessions"
>
> Then wait for folks to drain off your login node and do your stuff.
>
> When done, remove the /etc/nologin file and folks will be able to login
> again.
>
> Brian Andrus
> On 1/31/2022 9:18 PM, Sid Young wrote:
>
>
>
>
> Sid Young
> W: https://off-grid-engineering.com
> W: (personal) https://sidyoung.com/
> W: (personal) https://z900collector.wordpress.com/
>
>
> On Tue, Feb 1, 2022 at 3:02 PM Christopher Samuel 
> wrote:
>
>> On 1/31/22 4:41 pm, Sid Young wrote:
>>
>> > I need to replace a faulty DIMM chim in our login node so I need to
>> stop
>> > new jobs being kicked off while letting the old ones end.
>> >
>> > I thought I would just set all nodes to drain to stop new jobs from
>> > being kicked off...
>>
>> That would basically be the way, but is there any reason why compute
>> jobs shouldn't start whilst the login node is down?
>>
>
> My concern was to keep the running jobs going and stop new jobs, so when
> the last running job ends,
>  I could reboot the login node knowing that any terminal windows
> "screen"/"tmux" sessions would effectively
> have ended as the job(s) had now ended
>
> I'm not sure if there was an accepted procedure or best practice way to
> tackle shutting down the Login node for this use case.
>
> On the bright side I am down to two jobs left so any day now :)
>
> Sid
>
>
>
>
>> All the best,
>> Chris
>> --
>>Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>>
>>


Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Christopher Samuel

On 1/31/22 9:25 pm, Brian Andrus wrote:


touch /etc/nologin

That will prevent new logins.


It's also useful that if you put a message in /etc/nologin then users 
who are trying to login will get that message before being denied.


All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Brian Andrus

One possibility:

Sounds like your concern is folks with interactive jobs from the login 
node that are running under screen/tmux.


That being the case, you need running jobs to end and not allow new 
users to start tmux sessions.


Definitely doing 'scontrol update state=down partition=' for each 
partition. Also:


touch /etc/nologin

That will prevent new logins.

Send a message to all active folks

wall "system going down at XX:XX, please end your sessions"

Then wait for folks to drain off your login node and do your stuff.

When done, remove the /etc/nologin file and folks will be able to login 
again.


Brian Andrus

On 1/31/2022 9:18 PM, Sid Young wrote:




Sid Young
W: https://off-grid-engineering.com
W: (personal) https://sidyoung.com/
W: (personal) https://z900collector.wordpress.com/


On Tue, Feb 1, 2022 at 3:02 PM Christopher Samuel  
wrote:


On 1/31/22 4:41 pm, Sid Young wrote:

> I need to replace a faulty DIMM chim in our login node so I need
to stop
> new jobs being kicked off while letting the old ones end.
>
> I thought I would just set all nodes to drain to stop new jobs from
> being kicked off...

That would basically be the way, but is there any reason why compute
jobs shouldn't start whilst the login node is down?


My concern was to keep the running jobs going and stop new jobs, so 
when the last running job ends,
 I could reboot the login node knowing that any terminal windows 
"screen"/"tmux" sessions would effectively

have ended as the job(s) had now ended

I'm not sure if there was an accepted procedure or best practice way 
to tackle shutting down the Login node for this use case.


On the bright side I am down to two jobs left so any day now :)

Sid




All the best,
Chris
-- 
   Chris Samuel  : http://www.csamuel.org/ :  Berkeley, CA, USA


Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Sid Young
Sid Young
W: https://off-grid-engineering.com
W: (personal) https://sidyoung.com/
W: (personal) https://z900collector.wordpress.com/


On Tue, Feb 1, 2022 at 3:02 PM Christopher Samuel  wrote:

> On 1/31/22 4:41 pm, Sid Young wrote:
>
> > I need to replace a faulty DIMM chim in our login node so I need to stop
> > new jobs being kicked off while letting the old ones end.
> >
> > I thought I would just set all nodes to drain to stop new jobs from
> > being kicked off...
>
> That would basically be the way, but is there any reason why compute
> jobs shouldn't start whilst the login node is down?
>

My concern was to keep the running jobs going and stop new jobs, so when
the last running job ends,
 I could reboot the login node knowing that any terminal windows
"screen"/"tmux" sessions would effectively
have ended as the job(s) had now ended

I'm not sure if there was an accepted procedure or best practice way to
tackle shutting down the Login node for this use case.

On the bright side I am down to two jobs left so any day now :)

Sid




> All the best,
> Chris
> --
>Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
>


Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Christopher Samuel

On 1/31/22 9:00 pm, Christopher Samuel wrote:


That would basically be the way


Thinking further on this a better way would be to mark your partitions 
down, as it's likely you've got fewer partitions than compute nodes.


All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Christopher Samuel

On 1/31/22 4:41 pm, Sid Young wrote:

I need to replace a faulty DIMM chim in our login node so I need to stop 
new jobs being kicked off while letting the old ones end.


I thought I would just set all nodes to drain to stop new jobs from 
being kicked off...


That would basically be the way, but is there any reason why compute 
jobs shouldn't start whilst the login node is down?


All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-31 Thread Nousheen
Dear Ole and Hermann,

I have reinstalled slurm from scratch now following this link:

The error remains the same. Kindly guide me where will i find this
cred/munge plugin. Please help me resolve this issue.

[root@exxact slurm]# slurmd -C
NodeName=exxact CPUs=12 Boards=1 SocketsPerBoard=1 CoresPerSocket=6
ThreadsPerCore=2 RealMemory=31889
UpTime=0-22:06:45
[root@exxact slurm]# systemctl enable slurmctld.service
[root@exxact slurm]# systemctl start slurmctld.service
[root@exxact slurm]# systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor
preset: disabled)
   Active: failed (Result: exit-code) since Tue 2022-02-01 09:46:20 PKT; 8s
ago
  Process: 27530 ExecStart=/usr/local/sbin/slurmctld -D -s
$SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 27530 (code=exited, status=1/FAILURE)

Feb 01 09:46:20 exxact systemd[1]: Started Slurm controller daemon.
Feb 01 09:46:20 exxact systemd[1]: slurmctld.service: main process exited,
...RE
Feb 01 09:46:20 exxact systemd[1]: Unit slurmctld.service entered failed
state.
Feb 01 09:46:20 exxact systemd[1]: slurmctld.service failed.


[root@exxact slurm]# /usr/local/sbin/slurmctld -D
slurmctld: slurmctld version 21.08.5 started on cluster cluster194
slurmctld: error: Couldn't find the specified plugin name for cred/munge
looking at all files
slurmctld: error: cannot find cred plugin for cred/munge
slurmctld: error: cannot create cred context for cred/munge
slurmctld: fatal: slurm_cred_creator_ctx_create((null)): Operation not
permitted



Best Regards,
Nousheen Parvaiz
ᐧ

On Tue, Feb 1, 2022 at 9:06 AM Nousheen  wrote:

> Dear Ole,
>
> Thank you for your response.
> I am doing it again using your suggested link.
>
> Best Regards,
> Nousheen Parvaiz
>
>
> ᐧ
>
> On Mon, Jan 31, 2022 at 2:07 PM Ole Holm Nielsen <
> ole.h.niel...@fysik.dtu.dk> wrote:
>
>> Hi Nousheen,
>>
>> I recommend you again to follow the steps for installing Slurm on a
>> CentOS
>> 7 cluster:
>> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation
>>
>> Maybe you will need to start installation from scratch, but the steps are
>> guaranteed to work if followed correctly.
>>
>> IHTH,
>> Ole
>>
>> On 1/31/22 06:23, Nousheen wrote:
>> > The same error shows up on compute node which is as follows:
>> >
>> > [root@c103008 ~]# systemctl enable slurmd.service
>> > [root@c103008 ~]# systemctl start slurmd.service
>> > [root@c103008 ~]# systemctl status slurmd.service
>> > ● slurmd.service - Slurm node daemon
>> > Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor
>> > preset: disabled)
>> > Active: failed (Result: exit-code) since Mon 2022-01-31 00:22:42
>> EST;
>> > 2s ago
>> >Process: 11505 ExecStart=/usr/local/sbin/slurmd -D -s
>> $SLURMD_OPTIONS
>> > (code=exited, status=203/EXEC)
>> >   Main PID: 11505 (code=exited, status=203/EXEC)
>> >
>> > Jan 31 00:22:42 c103008 systemd[1]: Started Slurm node daemon.
>> > Jan 31 00:22:42 c103008 systemd[1]: slurmd.service: main process
>> exited,
>> > code=exited, status=203/EXEC
>> > Jan 31 00:22:42 c103008 systemd[1]: Unit slurmd.service entered failed
>> state.
>> > Jan 31 00:22:42 c103008 systemd[1]: slurmd.service failed.
>> >
>> >
>> > Best Regards,
>> > Nousheen Parvaiz
>> >
>> >
>> > ᐧ
>> >
>> > On Mon, Jan 31, 2022 at 10:08 AM Nousheen > > > wrote:
>> >
>> > Dear Jeffrey,
>> >
>> > Thank you for your response. I have followed the steps as
>> instructed.
>> > After the copying the files to their respective locations "systemctl
>> > status slurmctld.service" command gives me an error as follows:
>> >
>> > (base) [nousheen@exxact system]$ systemctl daemon-reload
>> > (base) [nousheen@exxact system]$ systemctl enable slurmctld.service
>> > (base) [nousheen@exxact system]$ systemctl start slurmctld.service
>> > (base) [nousheen@exxact system]$ systemctl status slurmctld.service
>> > ● slurmctld.service - Slurm controller daemon
>> > Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
>> > vendor preset: disabled)
>> > Active: failed (Result: exit-code) since Mon 2022-01-31 10:04:31
>> > PKT; 3s ago
>> >Process: 18114 ExecStart=/usr/local/sbin/slurmctld -D -s
>> > $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
>> >   Main PID: 18114 (code=exited, status=1/FAILURE)
>> >
>> > Jan 31 10:04:31 exxact systemd[1]: Started Slurm controller daemon.
>> > Jan 31 10:04:31 exxact systemd[1]: slurmctld.service: main process
>> > exited, code=exited, status=1/FAILURE
>> > Jan 31 10:04:31 exxact systemd[1]: Unit slurmctld.service entered
>> > failed state.
>> > Jan 31 10:04:31 exxact systemd[1]: slurmctld.service failed.
>> >
>> > Kindly guide me. Thank you so much for your time.
>> >
>> > Best Regards,
>> > Nousheen Parvaiz
>> >
>> > ᐧ
>> >
>> > On Thu, Jan 27, 2022 at 8:25 

Re: [slurm-users] systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-31 Thread Nousheen
Dear Ole,

Thank you for your response.
I am doing it again using your suggested link.

Best Regards,
Nousheen Parvaiz


ᐧ

On Mon, Jan 31, 2022 at 2:07 PM Ole Holm Nielsen 
wrote:

> Hi Nousheen,
>
> I recommend you again to follow the steps for installing Slurm on a CentOS
> 7 cluster:
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation
>
> Maybe you will need to start installation from scratch, but the steps are
> guaranteed to work if followed correctly.
>
> IHTH,
> Ole
>
> On 1/31/22 06:23, Nousheen wrote:
> > The same error shows up on compute node which is as follows:
> >
> > [root@c103008 ~]# systemctl enable slurmd.service
> > [root@c103008 ~]# systemctl start slurmd.service
> > [root@c103008 ~]# systemctl status slurmd.service
> > ● slurmd.service - Slurm node daemon
> > Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor
> > preset: disabled)
> > Active: failed (Result: exit-code) since Mon 2022-01-31 00:22:42
> EST;
> > 2s ago
> >Process: 11505 ExecStart=/usr/local/sbin/slurmd -D -s $SLURMD_OPTIONS
> > (code=exited, status=203/EXEC)
> >   Main PID: 11505 (code=exited, status=203/EXEC)
> >
> > Jan 31 00:22:42 c103008 systemd[1]: Started Slurm node daemon.
> > Jan 31 00:22:42 c103008 systemd[1]: slurmd.service: main process exited,
> > code=exited, status=203/EXEC
> > Jan 31 00:22:42 c103008 systemd[1]: Unit slurmd.service entered failed
> state.
> > Jan 31 00:22:42 c103008 systemd[1]: slurmd.service failed.
> >
> >
> > Best Regards,
> > Nousheen Parvaiz
> >
> >
> > ᐧ
> >
> > On Mon, Jan 31, 2022 at 10:08 AM Nousheen  > > wrote:
> >
> > Dear Jeffrey,
> >
> > Thank you for your response. I have followed the steps as instructed.
> > After the copying the files to their respective locations "systemctl
> > status slurmctld.service" command gives me an error as follows:
> >
> > (base) [nousheen@exxact system]$ systemctl daemon-reload
> > (base) [nousheen@exxact system]$ systemctl enable slurmctld.service
> > (base) [nousheen@exxact system]$ systemctl start slurmctld.service
> > (base) [nousheen@exxact system]$ systemctl status slurmctld.service
> > ● slurmctld.service - Slurm controller daemon
> > Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> > vendor preset: disabled)
> > Active: failed (Result: exit-code) since Mon 2022-01-31 10:04:31
> > PKT; 3s ago
> >Process: 18114 ExecStart=/usr/local/sbin/slurmctld -D -s
> > $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
> >   Main PID: 18114 (code=exited, status=1/FAILURE)
> >
> > Jan 31 10:04:31 exxact systemd[1]: Started Slurm controller daemon.
> > Jan 31 10:04:31 exxact systemd[1]: slurmctld.service: main process
> > exited, code=exited, status=1/FAILURE
> > Jan 31 10:04:31 exxact systemd[1]: Unit slurmctld.service entered
> > failed state.
> > Jan 31 10:04:31 exxact systemd[1]: slurmctld.service failed.
> >
> > Kindly guide me. Thank you so much for your time.
> >
> > Best Regards,
> > Nousheen Parvaiz
> >
> > ᐧ
> >
> > On Thu, Jan 27, 2022 at 8:25 PM Jeffrey R. Lang  > > wrote:
> >
> > The missing file error has nothing to do with slurm.  The
> > systemctl command is part of the systems service management.
> >
> > __ __
> >
> > The error message indicates that you haven’t copied the
> > slurmd.service file on your compute node to /etc/systemd/system
> or
> > /usr/lib/systemd/system.  /etc/systemd/system is usually used
> when
> > a user adds a new service to a machine.
> >
> > __ __
> >
> > Depending on your version of Linux you may also need to do a
> > systemctl daemon-reload to activate the slurmd.service within
> > system.
> >
> > __ __
> >
> > Once slurmd.service is copied over, the systemctld command should
> > work just fine.
> >
> > __ __
> >
> > Remember:
> >
> >  slurmd.service -  Only on compute nodes
> >
> >  slurmctld.service – Only on your cluster
> > management node
> >
> >slurmdbd.service – Only on your cluster management
> > node
> >
> > __ __
> >
> > *From:* slurm-users  > > *On Behalf Of
> > *Nousheen
> > *Sent:* Thursday, January 27, 2022 3:54 AM
> > *To:* Slurm User Community List  > >
> > *Subject:* [slurm-users] systemctl enable slurmd.service Failed
> to
> > execute operation: No such file or directory
> >
> > __ __
> >
> > ◆ This message was sent from a non-UWYO address. Please exercise
> > caution when clicking links or opening attachments from external
> > 

[slurm-users] Fwd: systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-31 Thread Nousheen
Best Regards,
Nousheen Parvaiz
Ph.D. Scholar
National Center For Bioinformatics
Quaid-i-Azam University, Islamabad


Dear Hermann,

Thank you for your reply. I have given below my slurm.conf and log file.


*# slurm.conf file generated by configurator easy.html.*# Put this file on
all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=Nousheen1
SlurmctldHost=192.168.60.194
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
#SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES

NodeName=Nousheen1 NodeAddr=192.168.60.194 CPUs=1 State=UNKNOWN
NodeName=Nousheen2 NodeAddr=192.168.60.104 CPUs=1 State=UNKNOWN

PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP


*Slurmctld.log*
[2022-01-31T11:42:26.293] error: chdir(/var/log): Permission denied
[2022-01-31T11:42:26.294] slurmctld version 21.08.5 started on cluster
cluster194
[2022-01-31T11:42:26.294] error: Couldn't find the specified plugin name
for cred/munge looking at all files
[2022-01-31T11:42:26.295] error: cannot find cred plugin for cred/munge
[2022-01-31T11:42:26.295] error: cannot create cred context for cred/munge
[2022-01-31T11:42:26.295] fatal: slurm_cred_creator_ctx_create((null)):
Operation not permitted


*(base) [nousheen@exxact Documents]$ usr/local/sbin/slurmctld -Dbash:
usr/local/sbin/slurmctld: No such file or directory*

Best Regards,
Nousheen Parvaiz


[slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Sid Young
G'Day all,

I need to replace a faulty DIMM chim in our login node so I need to stop
new jobs being kicked off while letting the old ones end.

I thought I would just set all nodes to drain to stop new jobs from being
kicked off... does this sound like a good idea? Down time window would be
20-30 minutes, scheduler is a separate node and I could email back any
users who try to SSH while the node is down.


Sid Young
W: https://off-grid-engineering.com
W: (personal) https://sidyoung.com/
W: (personal) https://z900collector.wordpress.com/


Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-01-31 Thread Stephan Roth


Not a solution, but some ideas & experiences concerning the same topic:

A few of our older GPUs used to show the error message "has fallen off 
the bus" which was only resolved by a full power cycle as well.


Something changed, nowadays the error messages is "GPU lost" and a 
normal reboot resolves the problem. This might be a result of an update 
of the Nvidia drivers (currently 60.73.01), but I can't be sure.


The current behaviour allowed us to write a script checking GPU state 
every 10 minutes and setting a node to drain state when such a 
"lost GPU" is detected.

This has been working well for a couple of months now and saves us time.

It might help as well to re-seat all GPUs and PCI risers, this also 
seemed to help in one of our GPU nodes. Again, I can't be sure, we'd 
need to do try this with other - still failing - GPUs.


The problem is to identify the cards physically from the information we 
have, like what's reported with nvidia-smi or available in 
/proc/driver/nvidia/gpus/*/information
The serial number isn't shown for every type of GPU and I'm not sure the 
ones shown match the stickers on the GPUs.
If anybody were to know of a practical solution for this, I'd be happy 
to read it.


Eventually I'd like to pull out all cards which repeatedly get "lost" 
and maybe move them all to a node for short debug jobs or throw them 
away (they're all beyond warranty anyway).


Stephan


On 31.01.22 15:45, Timony, Mick wrote:

I have a large compute node with 10 RTX8000 cards at a remote colo.
One of the cards on it is acting up "falling of the bus" once a day
requiring a full power cycle to reset.

I want jobs to avoid that card as well as the card it is NVLINK'ed to.

So I modified gres.conf on that node as follows:


# cat /etc/slurm/gres.conf
AutoDetect=nvml
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1
#Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3
#Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia4
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia5
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia6
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia7
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia8
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia9

and it slurm.conf I changed for node def Gres=gpu:quadro_rtx_8000:10
to be Gres=gpu:quadro_rtx_8000:8.  I restarted slurmctld and slurmd
after this.

I then put the node back from drain to idle.  Jobs were sumbitted and
started on the node but they are using the GPU I told it to avoid

++
| Processes: |
|  GPU   GI   CI    PID   Type   Process name GPU Memory |
|    ID   ID  Usage  |
||
|    0   N/A  N/A 63426  C   python 11293MiB |
|    1   N/A  N/A 63425  C   python 11293MiB |
|    2   N/A  N/A 63425  C   python 10869MiB |
|    2   N/A  N/A 63426  C   python 10869MiB |
|    4   N/A  N/A 63425  C   python 10849MiB |
|    4   N/A  N/A 63426  C   python 10849MiB |
++

How can I make SLURM not use GPU 2 and 4?

---
Paul Raines http://help.nmr.mgh.harvard.edu

MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129    USA


You can use the nvidia-smi command to 'drain' the GPU's which will 
power-down the GPU's and no applications will use them.


This thread on stack overflow explains how to do that:

https://unix.stackexchange.com/a/654089/94412 



You can create a script to run at boot and 'drain' the cards.

Regards
--Mick


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command

2022-01-31 Thread Timo Rothenpieler

Make sure you properly configured nsswitch.conf.
Most commonly this kind of issue indicates that you forgot to define 
initgroups correctly.


It should look something like this:

...
group:  files [SUCCESS=merge] systemd [SUCCESS=merge] ldap
...
initgroups: files [SUCCESS=continue] ldap
...


On 28.01.2022 06:56, Ratnasamy, Fritz wrote:

Hi,

I have a similar issue as described on the following link 
(https://groups.google.com/g/slurm-users/c/6SnwFV-S_Nk) 

A machine had some existing local permissions. We have added it as a 
compute node  to our cluster via Slurm. When running an srun interactive 
session on that server,

it would seem that the LDAP groups shadow the local groups.
johndoe@ecolonnelli:~ $ groups
Faculty_Collab ecolonnelli_access #Those are LDAP groups
johndoe@ecolonnelli:~ $ groups johndoe
johndoe : Faculty_Collab projectsbrasil core rais rfb polconnfirms 
johndoe vpce rfb_all backup_johndoe ecolonnelli_access


The issue is that now the user can not access folders that have
his local group permissions (such as projectsbrasil,rais, rfb, core 
ect..) when he request an interative session to that compute node

Is there any solution to that issue?
Best,





Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command

2022-01-31 Thread Russell Jones
I solved this issue by adding a group to IPA that matched the same name and
GID of the local groups, then using [SUCCESS=merge] in nsswitch.conf for
groups, and on our CentOS 8 nodes adding "enable_files_domain = False" in
the sssd.conf file*.*

On Fri, Jan 28, 2022 at 5:02 PM Ratnasamy, Fritz <
fritz.ratnas...@chicagobooth.edu> wrote:

> Hi Mitchell, Remi
>
> This is what returned the command:  find /sys/fs/cgroup -name "*71953*"
> /sys/fs/cgroup/freezer/slurm/uid_71953
> /sys/fs/cgroup/devices/slurm/uid_71953
> /sys/fs/cgroup/cpuset/slurm/uid_71953
> /sys/fs/cgroup/cpu,cpuacct/slurm/uid_71953
> /sys/fs/cgroup/memory/slurm/uid_71953
>
> Do you have any idea what could cause the issue?
> Thanks,
>
> *Fritz Ratnasamy*
>
> Data Scientist
>
> Information Technology
>
> The University of Chicago
>
> Booth School of Business
>
> 5807 S. Woodlawn
>
> Chicago, Illinois 60637
>
> Phone: +(1) 773-834-4556
>
>
> On Fri, Jan 28, 2022 at 12:01 PM Walls, Mitchell  wrote:
>
>> Do you see the uid in /sys/fs/cgroup? (i.e. find /sys/fs/cgroup -name
>> "*71953*"). If not that could point to cgroup config.
>>
>> 
>> From: slurm-users  on behalf of
>> Ratnasamy, Fritz 
>> Sent: Friday, January 28, 2022 11:13 AM
>> To: Rémi Palancher; Slurm User Community List; James Millsap
>> Subject: Re: [slurm-users] Secondary Unix group id of users not being
>> issued in interactive srun command
>>
>> Hi Remi,
>>
>>  Yes it does return the same id. See below:
>> johndoe@ecolonnelli:~ $ id
>> uid=71953(johndoe) gid=100026(Faculty_Collab)
>> groups=100026(Faculty_Collab),100181(ecolonnelli_access)
>> johndoe@ecolonnelli:~ $ id johndoe
>> uid=71953(johndoe) gid=100026(Faculty_Collab)
>> groups=100026(Faculty_Collab),1000(projectsbrasil),1003(core),1549(rais),1550(rfb),1552(polconnfirms),1558(vpce),1559(rfb_all),1563(johndoe),100181(ecolonnelli_access)
>>
>> Fritz Ratnasamy
>> Data Scientist
>> Information Technology
>> The University of Chicago
>> Booth School of Business
>> 5807 S. Woodlawn
>> Chicago, Illinois 60637
>> Phone: +(1) 773-834-4556
>>
>>
>> On Fri, Jan 28, 2022 at 2:04 AM Rémi Palancher > r...@rackslab.io>> wrote:
>> Le vendredi 28 janvier 2022 à 06:56, Ratnasamy, Fritz <
>> fritz.ratnas...@chicagobooth.edu>
>> a écrit :
>>
>> > Hi,
>> >
>> > I have a similar issue as described on the following link (
>> https://groups.google.com/g/slurm-users/c/6SnwFV-S_Nk)A machine had some
>> existing local permissions. We have added it as a compute node to our
>> cluster via Slurm. When running an srun interactive session on that
>> server,it would seem that the LDAP groups shadow the local groups.
>> >
>> > johndoe@ecolonnelli:~ $ groups
>> >
>> > Faculty_Collab ecolonnelli_access #Those are LDAP groups
>> >
>> > johndoe@ecolonnelli:~ $ groups johndoe
>> >
>> > johndoe : Faculty_Collab projectsbrasil core rais rfb polconnfirms
>> johndoe vpce rfb_all backup_johndoe ecolonnelli_access
>>
>> The difference between the first and the second command could be the UID
>> used for the resolution. The first command calls getgroups() syscall using
>> the UID of the shell. The second command resolves johndoe UID through
>> nsswitch stack then looks after the groups of this UID.
>>
>> Do you have johndoe declared in both local /etc/passwd and LDAP directory
>> with different UID?
>>
>> Do `id` and `id johndoe` return the same UID?
>>
>> --
>> Rémi Palancher
>> Rackslab: Open Source Solutions for HPC Operations
>> https://rackslab.io
>>
>>
>> CAUTION: This email has originated outside of University email systems.
>> Please do not click links or open attachments unless you recognize the
>> sender and trust the contents as safe.
>>
>>
>> CAUTION: This email has originated outside of University email systems.
>> Please do not click links or open attachments unless you recognize the
>> sender and trust the contents as safe.
>>
>>


Re: [slurm-users] addressing NVIDIA MIG + non MIG devices in Slurm

2022-01-31 Thread Bas van der Vlies

This is not an answer on the MIG issue but on the question that Esben
has. We at SURF have developed sharing of all the GPUs in a node. We 
"misuse" the SLURM mps feature. At SURF this mostly use for GPU courses, 
eg: jupyterhub


We have tested it with slum version  20.11.8. The code is public 
available at:

 * https://github.com/basvandervlies/surf_slurm_mps

On 27/01/2022 17:00, EPF (Esben Peter Friis) wrote:

Hi Mathias

I can't answer your specific question, so this is more of a comment 

We have a system with 8 x Nvidia A40, where we would like to share each 
GPU between several jobs (they have 48GB each), eg starting 32 jobs, 
with 4 on each GPU. I looked into MIG as well, but unfortunately that is 
not supported by the A40 hardware (only A30 and A100).
I have tried MPS, but strangely that works only for the first GPU on 
each node, so only one of the 8 GPUs in our system can be shared in this 
way. That is, it used to be like that. A couple of weeks ago, an 
"all_sharing" flag was introduced for gres.conf, which apparently should 
make it possible to share all the GPUs with MPS. I haven't tried it yet, 
but it may be worth a try. It should be possible to configure some GPUs 
as mps and some as gpu resources.


Cheers,

Esben


https://github.com/SchedMD/slurm/blob/master/doc/man/man5/gres.conf.5 



all_sharing
	To be used on a shared gres. This is the opposite of one_sharing and 
can be
	used to allow all sharing gres (gpu) on a node to be used for shared 
gres (mps).


	NOTE: If a gres has this flag configured it is global, so all other 
nodes with

that gres will have this flag implied.  This flag is not combatible with
one_sharing for a specific gres.





*From:* slurm-users  on behalf of 
Matthias Leopold 

*Sent:* Thursday, January 27, 2022 16:27
*To:* Slurm User Community List 
*Subject:* [slurm-users] addressing NVIDIA MIG + non MIG devices in Slurm
Hi,

we have 2 DGX A100 systems which we would like to use with Slurm. We
want to use the MIG feature for _some_ of the GPUs. As I somehow
suspected I couldn't find a working setup for this in Slurm yet. I'll
describe the configuration variants I tried after creating the MIG
instances, it might be a longer read, please bear with me.

1. using slurm-mig-discovery for gres.conf
(https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.com%2Fnvidia%2Fhpc%2Fslurm-mig-discoverydata=04%7C01%7Cepf%40novozymes.com%7Ca0989487f14947bd269808d9e1a99cc4%7C43d5f49ee03a4d22a2285684196bb001%7C0%7C0%7C637788940862367005%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=AHN8h8%2FcB4xeC7MFdYgZRG7L62PTiz4OvampC1vWL5Q%3Dreserved=0 
)

- CUDA_VISIBLE_DEVICES: list of indices
-> seems to bring a working setup and full flexibility at first, but
when taking a closer look the selection of GPU devices is completely
unpredictable (output of nvidia-smi inside Slurm job)

2. using "AutoDetect=nvml" in gres.conf (Slurm docs)
- CUDA_VISIBLE_DEVICES: MIG format (see
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvidia.com%2Fcuda%2Fcuda-c-programming-guide%2Findex.html%23env-varsdata=04%7C01%7Cepf%40novozymes.com%7Ca0989487f14947bd269808d9e1a99cc4%7C43d5f49ee03a4d22a2285684196bb001%7C0%7C0%7C637788940862367005%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=ZFXeCQh3%2FLO52zpiOqn7CROjq7wvj8zWRE8mj87P%2Bew%3Dreserved=0) 



2.1 converting ALL GPUs to MIG
- also a full A100 is converted to a 7g.40gb MIG instance
- gres.conf: "AutoDetect=nvml" only
- slurm.conf Node Def: naming all MIG types (read from slurmd debug log)
-> working setup
-> problem: IPC (MPI) between MIG instances not possible, this seems to
be a by-design limitation

2.2 converting SOME GPUs to MIG
- some A100 are NOT in MIG mode

2.2.1 using "AutoDetect=nvml" only (Variant 1)
- slurm.conf Node Def: Gres with and without type
-> problem: fatal: _foreach_slurm_conf: 

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-01-31 Thread Timony, Mick
I have a large compute node with 10 RTX8000 cards at a remote colo.
One of the cards on it is acting up "falling of the bus" once a day
requiring a full power cycle to reset.

I want jobs to avoid that card as well as the card it is NVLINK'ed to.

So I modified gres.conf on that node as follows:


# cat /etc/slurm/gres.conf
AutoDetect=nvml
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1
#Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3
#Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia4
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia5
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia6
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia7
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia8
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia9

and it slurm.conf I changed for node def Gres=gpu:quadro_rtx_8000:10
to be Gres=gpu:quadro_rtx_8000:8.  I restarted slurmctld and slurmd
after this.

I then put the node back from drain to idle.  Jobs were sumbitted and
started on the node but they are using the GPU I told it to avoid

++
| Processes: |
|  GPU   GI   CIPID   Type   Process name GPU Memory |
|ID   ID  Usage  |
||
|0   N/A  N/A 63426  C   python 11293MiB |
|1   N/A  N/A 63425  C   python 11293MiB |
|2   N/A  N/A 63425  C   python 10869MiB |
|2   N/A  N/A 63426  C   python 10869MiB |
|4   N/A  N/A 63425  C   python 10849MiB |
|4   N/A  N/A 63426  C   python 10849MiB |
++

How can I make SLURM not use GPU 2 and 4?

---
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129USA


You can use the nvidia-smi command to 'drain' the GPU's which will power-down 
the GPU's and no applications will use them.

This thread on stack overflow explains how to do that:

https://unix.stackexchange.com/a/654089/94412

You can create a script to run at boot and 'drain' the cards.

Regards
--Mick




Re: [slurm-users] addressing NVIDIA MIG + non MIG devices in Slurm - solved

2022-01-31 Thread Matthias Leopold

I looked at option
> 2.2.3 using partial "AutoDetect=nvml"
again and saw that the reason for failure was indeed the sanity check, 
but it was my fault because I set an invalid "Links" value for the 
"hardcoded" GPUs. So this variant of gres.conf setup works and gives me 
everything I want, sorry for bothering you.


Matthias

Am 27.01.22 um 16:27 schrieb Matthias Leopold:

Hi,

we have 2 DGX A100 systems which we would like to use with Slurm. We 
want to use the MIG feature for _some_ of the GPUs. As I somehow 
suspected I couldn't find a working setup for this in Slurm yet. I'll 
describe the configuration variants I tried after creating the MIG 
instances, it might be a longer read, please bear with me.


1. using slurm-mig-discovery for gres.conf 
(https://gitlab.com/nvidia/hpc/slurm-mig-discovery)

- CUDA_VISIBLE_DEVICES: list of indices
-> seems to bring a working setup and full flexibility at first, but 
when taking a closer look the selection of GPU devices is completely 
unpredictable (output of nvidia-smi inside Slurm job)


2. using "AutoDetect=nvml" in gres.conf (Slurm docs)
- CUDA_VISIBLE_DEVICES: MIG format (see 
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars)


2.1 converting ALL GPUs to MIG
- also a full A100 is converted to a 7g.40gb MIG instance
- gres.conf: "AutoDetect=nvml" only
- slurm.conf Node Def: naming all MIG types (read from slurmd debug log)
-> working setup
-> problem: IPC (MPI) between MIG instances not possible, this seems to 
be a by-design limitation


2.2 converting SOME GPUs to MIG
- some A100 are NOT in MIG mode

2.2.1 using "AutoDetect=nvml" only (Variant 1)
- slurm.conf Node Def: Gres with and without type
-> problem: fatal: _foreach_slurm_conf: Some gpu GRES in slurm.conf have 
a type while others do not (slurm_gres->gres_cnt_config (26) > tmp_count 
(21))


2.2.2 using "AutoDetect=nvml" only (Variant 2)
- slurm.conf Node Def: only Gres without type (sum of MIG + non MIG)
-> problem: different GPU types can't be requested

2.2.3 using partial "AutoDetect=nvml"
- gres.conf: "AutoDetect=nvml" + hardcoding of non MIG GPUs
- slurm.conf Node Def: MIG + non MIG Gres types
-> produces a "perfect" config according to slurmd debug log
-> problem: the sanity-check mode of "AutoDetect=nvml" prevents 
operation (?)
-> Reason=gres/gpu:1g.5gb count too low (0 < 21) 
[slurm@2022-01-27T11:23:59]


2.2.4 using static gres.conf with NVML generated config
- using a gres.conf with NVML generated config where I can define the 
type for non MIG GPU and also set the UniqueId for MIG instances would 
be the perfect solution

- slurm.conf Node Def: MIG + non MIG Gres types
-> problem: it doesn't work
-> Parsing error at unrecognized key: UniqueId

Thanks for reading this far. Am I missing something? How can MIG and non 
MIG devices be addressed in a cluster? This setup of having MIG and non 
MIG devices can't be exotic, since having ONLY MIG devices has severe 
disadvantages (see 2.1). Thanks again for any advice.


Matthias



--
Matthias Leopold
IT Systems & Communications
Medizinische Universität Wien
Spitalgasse 23 / BT 88 / Ebene 00
A-1090 Wien
Tel: +43 1 40160-21241
Fax: +43 1 40160-921200



Re: [slurm-users] systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-31 Thread Ole Holm Nielsen

Hi Nousheen,

I recommend you again to follow the steps for installing Slurm on a CentOS 
7 cluster:

https://wiki.fysik.dtu.dk/niflheim/Slurm_installation

Maybe you will need to start installation from scratch, but the steps are 
guaranteed to work if followed correctly.


IHTH,
Ole

On 1/31/22 06:23, Nousheen wrote:

The same error shows up on compute node which is as follows:

[root@c103008 ~]# systemctl enable slurmd.service
[root@c103008 ~]# systemctl start slurmd.service
[root@c103008 ~]# systemctl status slurmd.service
● slurmd.service - Slurm node daemon
    Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor 
preset: disabled)
    Active: failed (Result: exit-code) since Mon 2022-01-31 00:22:42 EST; 
2s ago
   Process: 11505 ExecStart=/usr/local/sbin/slurmd -D -s $SLURMD_OPTIONS 
(code=exited, status=203/EXEC)

  Main PID: 11505 (code=exited, status=203/EXEC)

Jan 31 00:22:42 c103008 systemd[1]: Started Slurm node daemon.
Jan 31 00:22:42 c103008 systemd[1]: slurmd.service: main process exited, 
code=exited, status=203/EXEC

Jan 31 00:22:42 c103008 systemd[1]: Unit slurmd.service entered failed state.
Jan 31 00:22:42 c103008 systemd[1]: slurmd.service failed.


Best Regards,
Nousheen Parvaiz


ᐧ

On Mon, Jan 31, 2022 at 10:08 AM Nousheen > wrote:


Dear Jeffrey,

Thank you for your response. I have followed the steps as instructed.
After the copying the files to their respective locations "systemctl
status slurmctld.service" command gives me an error as follows:

(base) [nousheen@exxact system]$ systemctl daemon-reload
(base) [nousheen@exxact system]$ systemctl enable slurmctld.service
(base) [nousheen@exxact system]$ systemctl start slurmctld.service
(base) [nousheen@exxact system]$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
    Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
vendor preset: disabled)
    Active: failed (Result: exit-code) since Mon 2022-01-31 10:04:31
PKT; 3s ago
   Process: 18114 ExecStart=/usr/local/sbin/slurmctld -D -s
$SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
  Main PID: 18114 (code=exited, status=1/FAILURE)

Jan 31 10:04:31 exxact systemd[1]: Started Slurm controller daemon.
Jan 31 10:04:31 exxact systemd[1]: slurmctld.service: main process
exited, code=exited, status=1/FAILURE
Jan 31 10:04:31 exxact systemd[1]: Unit slurmctld.service entered
failed state.
Jan 31 10:04:31 exxact systemd[1]: slurmctld.service failed.

Kindly guide me. Thank you so much for your time.

Best Regards,
Nousheen Parvaiz

ᐧ

On Thu, Jan 27, 2022 at 8:25 PM Jeffrey R. Lang mailto:jrl...@uwyo.edu>> wrote:

The missing file error has nothing to do with slurm.  The
systemctl command is part of the systems service management.

__ __

The error message indicates that you haven’t copied the
slurmd.service file on your compute node to /etc/systemd/system or
/usr/lib/systemd/system.  /etc/systemd/system is usually used when
a user adds a new service to a machine.

__ __

Depending on your version of Linux you may also need to do a
systemctl daemon-reload to activate the slurmd.service within
system.

__ __

Once slurmd.service is copied over, the systemctld command should
work just fine.

__ __

Remember:

     slurmd.service -  Only on compute nodes

     slurmctld.service – Only on your cluster
management node

       slurmdbd.service – Only on your cluster management
node

__ __

*From:* slurm-users mailto:slurm-users-boun...@lists.schedmd.com>> *On Behalf Of
*Nousheen
*Sent:* Thursday, January 27, 2022 3:54 AM
*To:* Slurm User Community List mailto:slurm-users@lists.schedmd.com>>
*Subject:* [slurm-users] systemctl enable slurmd.service Failed to
execute operation: No such file or directory

__ __

◆ This message was sent from a non-UWYO address. Please exercise
caution when clicking links or opening attachments from external
sources.

__ __

__ __

Hello everyone,

__ __

I am installing slurm on Centos 7 following tutorial:
https://www.slothparadise.com/how-to-install-slurm-on-centos-7-cluster/



__ __

I am at the step where we start slurm but it gives me the
following error:

__ __

[root@exxact slurm-21.08.5]# systemctl enable slurmd.service

Failed to execute operation: No such file or directory

__ __

I have run the command to check if slurm 

Re: [slurm-users] systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-31 Thread Hermann Schwärzler

Dear Nousheen,

I guess there is something missing in your installation - proably your 
slurm.conf?


Do you have logging enabled for slurmctld? If yes what do you see in 
that log?

Or what do you get if you run slurmctld manually like this:

/usr/local/sbin/slurmctld -D

Regards,
Hermann

On 1/31/22 6:08 AM, Nousheen wrote:

Dear Jeffrey,

Thank you for your response. I have followed the steps as instructed. 
After the copying the files to their respective locations "systemctl 
status slurmctld.service" command gives me an error as follows:


(base) [nousheen@exxact system]$ systemctl daemon-reload
(base) [nousheen@exxact system]$ systemctl enable slurmctld.service
(base) [nousheen@exxact system]$ systemctl start slurmctld.service
(base) [nousheen@exxact system]$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
    Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; 
vendor preset: disabled)
    Active: failed (Result: exit-code) since Mon 2022-01-31 10:04:31 
PKT; 3s ago
   Process: 18114 ExecStart=/usr/local/sbin/slurmctld -D -s 
$SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)

  Main PID: 18114 (code=exited, status=1/FAILURE)

Jan 31 10:04:31 exxact systemd[1]: Started Slurm controller daemon.
Jan 31 10:04:31 exxact systemd[1]: slurmctld.service: main process 
exited, code=exited, status=1/FAILURE
Jan 31 10:04:31 exxact systemd[1]: Unit slurmctld.service entered failed 
state.

Jan 31 10:04:31 exxact systemd[1]: slurmctld.service failed.

Kindly guide me. Thank you so much for your time.

Best Regards,
Nousheen Parvaiz

ᐧ

On Thu, Jan 27, 2022 at 8:25 PM Jeffrey R. Lang > wrote:


The missing file error has nothing to do with slurm.  The systemctl
command is part of the systems service management.

__ __

The error message indicates that you haven’t copied the
slurmd.service file on your compute node to /etc/systemd/system or
/usr/lib/systemd/system.  /etc/systemd/system is usually used when a
user adds a new service to a machine.

__ __

Depending on your version of Linux you may also need to do a
systemctl daemon-reload to activate the slurmd.service within
system.

__ __

Once slurmd.service is copied over, the systemctld command should
work just fine.

__ __

Remember:

     slurmd.service -  Only on compute nodes

     slurmctld.service – Only on your cluster management
node

       slurmdbd.service – Only on your cluster management
node

__ __

*From:* slurm-users mailto:slurm-users-boun...@lists.schedmd.com>> *On Behalf Of *Nousheen
*Sent:* Thursday, January 27, 2022 3:54 AM
*To:* Slurm User Community List mailto:slurm-users@lists.schedmd.com>>
*Subject:* [slurm-users] systemctl enable slurmd.service Failed to
execute operation: No such file or directory

__ __

◆ This message was sent from a non-UWYO address. Please exercise
caution when clicking links or opening attachments from external
sources.

__ __

__ __

Hello everyone,

__ __

I am installing slurm on Centos 7 following tutorial:
https://www.slothparadise.com/how-to-install-slurm-on-centos-7-cluster/



__ __

I am at the step where we start slurm but it gives me the following
error:

__ __

[root@exxact slurm-21.08.5]# systemctl enable slurmd.service

Failed to execute operation: No such file or directory

__ __

I have run the command to check if slurm is configured properly

__ __

[root@exxact slurm-21.08.5]# slurmd -C
NodeName=exxact CPUs=12 Boards=1 SocketsPerBoard=1 CoresPerSocket=6
ThreadsPerCore=2 RealMemory=31889
UpTime=19-16:06:00

__ __

I am new to this and unable to understand the problem. Kindly help
me resolve this.

__ __

My slurm.conf file is as follows:

__ __

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster194
SlurmctldHost=192.168.60.194
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=1
#MaxStepCount=4
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=