Re: [slurm-users] Add partition to existing user association

2022-01-24 Thread Thekla Loizou

Hi all,

I agree with both Marcus and Loris.

I am referring to modifying the association basically since when we 
first created our associations we had only "user" and "account" and now 
we also need to add the "partition".


My understanding from the documentation was that I would be able to 
modify the association and add the partition but it seems that this is 
not the case.


So I guess we will proceed with my original solution to delete all 
associations consisting of "user" and "account" and create new ones 
consisting of "user", "account" and "partition".


Regards,

Thekla

On 24/1/22 3:54 μ.μ., Marcus Wagner wrote:

Hi all,

an association is a triple (quadruple if you have several clusters) 
consisting of "user", "account" and "partition".

So, you need to add a association.

I'm not sure, how the accounting works, if no partition is set. We are 
always setting that triple, automatically during first submission to 
this triple.


Not all users / account are allowed to use all partitions. This is 
checked externally, and if the user is allowed to submit to a 
partition with a specific account, we add that triple with sacctmgr.



Best
Marcus

Am 24.01.2022 um 14:38 schrieb Loris Bennett:

Dear Thekla,

Disclaimer: Firstly, I find account management in Slurm confusing and
the documentation strangely unenlightening.  Secondly, I don't make many
changes to things once users have been set up, so I have very little
experience of actually tweaking the accounting.

Despite my understanding of the documentation that you *can* modify the
partition of a user, I don't think this is actually the case. If I look
at the database, the user table has no column 'partition' whereas the
association table does.

So you might be able to modify the association, but you might also just
have to delete the association and recreate it with the desired
partitions.  Or you might have to do something entirely different ...

Maybe people who do understand Slurm's account management can chip in.

Cheers,

Loris

Thekla Loizou  writes:


Dear Dori,

Thanks for your reply. Unfortunately this does not work either...

Best,

Thekla

On 21/1/22 7:43 μ.μ., Dori Sajdak wrote:

Hi Thekla,

When it comes to partitions, I believe you need to specify the 
cluster so in your example:


sacctmgr modify user thekla account=ops set partition=gpu where 
cluster=YourClusterName


QOS is not tied to a specific cluster but partitions are. That 
should work for you.


Dori

***
Dori Sajdak (she/her/hers)
Senior Systems Administrator
Center for Computational Research
University at Buffalo, State University of New York
701 Ellicott St
Buffalo, New York 14203
Phone: (716) 881-8934
Fax: (716) 849-6656
Web: http://buffalo.edu/ccr
Help Desk:  https://ubccr.freshdesk.com
Twitter:  https://twitter.com/ubccr
***



-Original Message-
From: slurm-users  On Behalf 
Of Thekla Loizou

Sent: Friday, January 21, 2022 9:12 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Add partition to existing user association

Dear all,

I was wondering if there is a way to add a partition to an existing 
user association.


For example if I have an association of user thekla to an account 
ops I can set a qos for the existing association:


sacctmgr modify user thekla account=ops set qos=nosubmit
    Modified user associations...
     C = cyclamen   A = ops  U = thekla

However, I cannot set a partition:

sacctmgr modify user thekla account=ops set partition=gpu
    Unknown option: partition=gpu
    Use keyword 'where' to modify condition

This is not possible?

The only solution I found to that is to delete the association and 
create it again with the partition:


sacctmgr del user thekla account=ops

sacctmgr add user thekla account=ops partition=gpu

Thank you,

Thekla










Re: [slurm-users] Add partition to existing user association

2022-01-24 Thread Thekla Loizou

Dear Dori,

Thanks for your reply. Unfortunately this does not work either...

Best,

Thekla

On 21/1/22 7:43 μ.μ., Dori Sajdak wrote:

Hi Thekla,

When it comes to partitions, I believe you need to specify the cluster so in 
your example:

sacctmgr modify user thekla account=ops set partition=gpu where 
cluster=YourClusterName

QOS is not tied to a specific cluster but partitions are.  That should work for 
you.

Dori

***
Dori Sajdak (she/her/hers)
Senior Systems Administrator
Center for Computational Research
University at Buffalo, State University of New York
701 Ellicott St
Buffalo, New York 14203
Phone: (716) 881-8934
Fax: (716) 849-6656
Web: http://buffalo.edu/ccr
Help Desk:  https://ubccr.freshdesk.com
Twitter:  https://twitter.com/ubccr
***



-Original Message-
From: slurm-users  On Behalf Of Thekla 
Loizou
Sent: Friday, January 21, 2022 9:12 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Add partition to existing user association

Dear all,

I was wondering if there is a way to add a partition to an existing user 
association.

For example if I have an association of user thekla to an account ops I can set 
a qos for the existing association:

sacctmgr modify user thekla account=ops set qos=nosubmit
   Modified user associations...
    C = cyclamen   A = ops  U = thekla

However, I cannot set a partition:

sacctmgr modify user thekla account=ops set partition=gpu
   Unknown option: partition=gpu
   Use keyword 'where' to modify condition

This is not possible?

The only solution I found to that is to delete the association and create it 
again with the partition:

sacctmgr del user thekla account=ops

sacctmgr add user thekla account=ops partition=gpu

Thank you,

Thekla






[slurm-users] Add partition to existing user association

2022-01-21 Thread Thekla Loizou

Dear all,

I was wondering if there is a way to add a partition to an existing user 
association.


For example if I have an association of user thekla to an account ops I 
can set a qos for the existing association:


sacctmgr modify user thekla account=ops set qos=nosubmit
 Modified user associations...
  C = cyclamen   A = ops  U = thekla

However, I cannot set a partition:

sacctmgr modify user thekla account=ops set partition=gpu
 Unknown option: partition=gpu
 Use keyword 'where' to modify condition

This is not possible?

The only solution I found to that is to delete the association and 
create it again with the partition:


sacctmgr del user thekla account=ops

sacctmgr add user thekla account=ops partition=gpu

Thank you,

Thekla




Re: [slurm-users] Job array start time and SchedNodes

2021-12-09 Thread Thekla Loizou

Dear Loris,

Yes it is indeed a bit odd. At least now I know that this is how SLURM 
behaves and not something that has to do with our configuration.


Regards,

Thekla

On 9/12/21 1:04 μ.μ., Loris Bennett wrote:

Dear Thekla,

Yes, I think you are right.  I have found a similar job on my system and
this does seem to be the normal, slightly confusing behaviour.  It looks
as if the pending elements of the array get assigned a single node,
but then start on other nodes:

   $ squeue -j 8536946 -O jobid,jobarrayid,reason,schednodes,nodelist,state | 
head
   JOBID   JOBID   REASON  SCHEDNODES   
   NODELISTSTATE
   8536946 8536946_[401-899]   Resources   g002 
   PENDING
   8658719 8536946_400 None(null)   
   g006RUNNING
   8658685 8536946_399 None(null)   
   g012RUNNING
   8658625 8536946_398 None(null)   
   g001RUNNING
   8658491 8536946_397 None(null)   
   g006RUNNING
   8658428 8536946_396 None(null)   
   g003RUNNING
   8658427 8536946_395 None(null)   
   g003RUNNING
   8658426 8536946_394 None(null)   
   g007RUNNING
   8658425 8536946_393 None(null)   
   g002RUNNING

This strikes me as a bit odd.

Cheers,

Loris

Thekla Loizou  writes:


Dear Loris,

Thank you for your reply. I don't believe that there is something wrong with the
job configuration or the node configuration to be honest.

I have just submitted a simple sleep script:

#!/bin/bash

sleep 10

as below:

sbatch --array=1-10 --ntasks-per-node=40 --time=09:00:00 test.sh

and squeue shows:

   131799_1   cpu  test.sh   thekla PD N/A  1
cn04 (Priority)
   131799_2   cpu  test.sh   thekla PD N/A  1
cn04 (Priority)
   131799_3   cpu  test.sh   thekla PD N/A  1
cn04 (Priority)
   131799_4   cpu  test.sh   thekla PD N/A  1
cn04 (Priority)
   131799_5   cpu  test.sh   thekla PD N/A  1
cn04 (Priority)
   131799_6   cpu  test.sh   thekla PD N/A  1
cn04 (Priority)
   131799_7   cpu  test.sh   thekla PD N/A  1
cn04 (Priority)
   131799_8   cpu  test.sh   thekla PD N/A  1
cn04 (Priority)
   131799_9   cpu  test.sh   thekla PD N/A  1
cn04 (Priority)
  131799_10   cpu  test.sh   thekla PD N/A  1
cn04 (Priority)

All of the jobs seem to be scheduled on node cn04.

When they start running they run on separate nodes:

       131799_1   cpu  test.sh   thekla  R   0:02 1 cn01
   131799_2   cpu  test.sh   thekla  R   0:02 1 cn02
   131799_3   cpu  test.sh   thekla  R   0:02 1 cn03
   131799_4   cpu  test.sh   thekla  R   0:02 1 cn04

Regards,

Thekla

On 7/12/21 5:17 μ.μ., Loris Bennett wrote:

Dear Thekla,

Thekla Loizou  writes:


Dear Loris,

There is no specific node required for this array. I can verify that from
"scontrol show job 124841" since the requested node list is empty:
ReqNodeList=(null)

Also, all 17 nodes of the cluster are identical so all nodes fulfill the job
requirements, not only node cn06.

By "saving" the other nodes I mean that the scheduler estimates that the array
jobs will start on 2021-12-11T03:58:00. No other jobs are scheduled to run
during that time on the other nodes. So it seems that somehow the scheduler
schedules the array jobs on more than one nodes but this is not showing in the
squeue or scontrol output.

My guess is that there is something wrong with either the job
configuration or the node configuration, if Slurm thinks 9 jobs which
require a whole node can all be started simultaneously on same node.

Cheers,

Loris


Regards,

Thekla


On 7/12/21 12:16 μ.μ., Loris Bennett wrote:

Hi Thekla,

Thekla Loizou  writes:


Dear all,

I have noticed that SLURM schedules several jobs from a job array on the same
node with the same start time and end time.

Each of these jobs requires the full node. You can see the squeue output below:

             JOBID     PARTITION  ST   START_TIME  NODES SCHEDNODES
NODELIST(REASON)

     124841_1   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
     124841_2   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
     124841_3   cpu PD

Re: [slurm-users] Job array start time and SchedNodes

2021-12-09 Thread Thekla Loizou

Dear Loris,

Thank you for your reply. I don't believe that there is something wrong 
with the job configuration or the node configuration to be honest.


I have just submitted a simple sleep script:

#!/bin/bash

sleep 10

as below:

sbatch --array=1-10 --ntasks-per-node=40 --time=09:00:00 test.sh

and squeue shows:

  131799_1   cpu  test.sh   thekla PD N/A  1 
cn04 (Priority)
  131799_2   cpu  test.sh   thekla PD N/A  1 
cn04 (Priority)
  131799_3   cpu  test.sh   thekla PD N/A  1 
cn04 (Priority)
  131799_4   cpu  test.sh   thekla PD N/A  1 
cn04 (Priority)
  131799_5   cpu  test.sh   thekla PD N/A  1 
cn04 (Priority)
  131799_6   cpu  test.sh   thekla PD N/A  1 
cn04 (Priority)
  131799_7   cpu  test.sh   thekla PD N/A  1 
cn04 (Priority)
  131799_8   cpu  test.sh   thekla PD N/A  1 
cn04 (Priority)
  131799_9   cpu  test.sh   thekla PD N/A  1 
cn04 (Priority)
 131799_10   cpu  test.sh   thekla PD N/A  1 
cn04 (Priority)


All of the jobs seem to be scheduled on node cn04.

When they start running they run on separate nodes:

      131799_1   cpu  test.sh   thekla  R   0:02 1 cn01
  131799_2   cpu  test.sh   thekla  R   0:02 1 cn02
  131799_3   cpu  test.sh   thekla  R   0:02 1 cn03
  131799_4   cpu  test.sh   thekla  R   0:02 1 cn04

Regards,

Thekla

On 7/12/21 5:17 μ.μ., Loris Bennett wrote:

Dear Thekla,

Thekla Loizou  writes:


Dear Loris,

There is no specific node required for this array. I can verify that from
"scontrol show job 124841" since the requested node list is empty:
ReqNodeList=(null)

Also, all 17 nodes of the cluster are identical so all nodes fulfill the job
requirements, not only node cn06.

By "saving" the other nodes I mean that the scheduler estimates that the array
jobs will start on 2021-12-11T03:58:00. No other jobs are scheduled to run
during that time on the other nodes. So it seems that somehow the scheduler
schedules the array jobs on more than one nodes but this is not showing in the
squeue or scontrol output.

My guess is that there is something wrong with either the job
configuration or the node configuration, if Slurm thinks 9 jobs which
require a whole node can all be started simultaneously on same node.

Cheers,

Loris


Regards,

Thekla


On 7/12/21 12:16 μ.μ., Loris Bennett wrote:

Hi Thekla,

Thekla Loizou  writes:


Dear all,

I have noticed that SLURM schedules several jobs from a job array on the same
node with the same start time and end time.

Each of these jobs requires the full node. You can see the squeue output below:

            JOBID     PARTITION  ST   START_TIME  NODES SCHEDNODES
NODELIST(REASON)

    124841_1   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
    124841_2   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
    124841_3   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
    124841_4   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
    124841_5   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
    124841_6   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
    124841_7   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
    124841_8   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
    124841_9   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)

Is this a bug or am I missing something? Is this because the jobs have the same
JOBID and are still in pending state? I am aware that the jobs will not actually
all run on the same node at the same time and that the scheduler somehow takes
into account that this job array has 9 jobs that will need 9 nodes. I am
creating a timeline with the start time of all jobs and when the job array jobs
will start running no other jobs are set to run on the remaining nodes (so it
"saves" the other nodes for the jobs of the array even if they are all scheduled
to run on the same node based on squeue or scontrol).

In general jobs from an array will be scheduled on whatever nodes
fulfil their requirements.  The fact that all the jobs have

cn06

as NODELIST however seems to suggest that you have either specified cn06
as the node the jobs should run on, or cn06 is the only node which
fulfils the job requirements.

I'm not sure what you mean about '"saving" the other nodes'.

Cheers,

Loris





Re: [slurm-users] Job array start time and SchedNodes

2021-12-07 Thread Thekla Loizou

Dear Loris,

There is no specific node required for this array. I can verify that 
from "scontrol show job 124841" since the requested node list is empty: 
ReqNodeList=(null)


Also, all 17 nodes of the cluster are identical so all nodes fulfill the 
job requirements, not only node cn06.


By "saving" the other nodes I mean that the scheduler estimates that the 
array jobs will start on 2021-12-11T03:58:00. No other jobs are 
scheduled to run during that time on the other nodes. So it seems that 
somehow the scheduler schedules the array jobs on more than one nodes 
but this is not showing in the squeue or scontrol output.


Regards,

Thekla


On 7/12/21 12:16 μ.μ., Loris Bennett wrote:

Hi Thekla,

Thekla Loizou  writes:


Dear all,

I have noticed that SLURM schedules several jobs from a job array on the same
node with the same start time and end time.

Each of these jobs requires the full node. You can see the squeue output below:

           JOBID     PARTITION  ST   START_TIME  NODES SCHEDNODES
NODELIST(REASON)

   124841_1   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
   124841_2   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
   124841_3   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
   124841_4   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
   124841_5   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
   124841_6   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
   124841_7   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
   124841_8   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)
   124841_9   cpu PD 2021-12-11T03:58:00  1
cn06 (Priority)

Is this a bug or am I missing something? Is this because the jobs have the same
JOBID and are still in pending state? I am aware that the jobs will not actually
all run on the same node at the same time and that the scheduler somehow takes
into account that this job array has 9 jobs that will need 9 nodes. I am
creating a timeline with the start time of all jobs and when the job array jobs
will start running no other jobs are set to run on the remaining nodes (so it
"saves" the other nodes for the jobs of the array even if they are all scheduled
to run on the same node based on squeue or scontrol).

In general jobs from an array will be scheduled on whatever nodes
fulfil their requirements.  The fact that all the jobs have

   cn06

as NODELIST however seems to suggest that you have either specified cn06
as the node the jobs should run on, or cn06 is the only node which
fulfils the job requirements.

I'm not sure what you mean about '"saving" the other nodes'.

Cheers,

Loris





[slurm-users] Job array start time and SchedNodes

2021-12-07 Thread Thekla Loizou

Dear all,

I have noticed that SLURM schedules several jobs from a job array on the 
same node with the same start time and end time.


Each of these jobs requires the full node. You can see the squeue output 
below:


          JOBID     PARTITION  ST   START_TIME  NODES 
SCHEDNODES   NODELIST(REASON)


  124841_1   cpu PD 2021-12-11T03:58:00  1 
cn06 (Priority)
  124841_2   cpu PD 2021-12-11T03:58:00  1 
cn06 (Priority)
  124841_3   cpu PD 2021-12-11T03:58:00  1 
cn06 (Priority)
  124841_4   cpu PD 2021-12-11T03:58:00  1 
cn06 (Priority)
  124841_5   cpu PD 2021-12-11T03:58:00  1 
cn06 (Priority)
  124841_6   cpu PD 2021-12-11T03:58:00  1 
cn06 (Priority)
  124841_7   cpu PD 2021-12-11T03:58:00  1 
cn06 (Priority)
  124841_8   cpu PD 2021-12-11T03:58:00  1 
cn06 (Priority)
  124841_9   cpu PD 2021-12-11T03:58:00  1 
cn06 (Priority)


Is this a bug or am I missing something? Is this because the jobs have 
the same JOBID and are still in pending state? I am aware that the jobs 
will not actually all run on the same node at the same time and that the 
scheduler somehow takes into account that this job array has 9 jobs that 
will need 9 nodes. I am creating a timeline with the start time of all 
jobs and when the job array jobs will start running no other jobs are 
set to run on the remaining nodes (so it "saves" the other nodes for the 
jobs of the array even if they are all scheduled to run on the same node 
based on squeue or scontrol).


Regards,
Thekla Loizou
HPC Systems Engineer
The Cyprus Institute



Re: [slurm-users] Building SLURM with X11 support

2021-05-31 Thread Thekla Loizou

Hi all,

I managed to solve the issue...xauth was missing from the compute nodes!

I increased the debug level in the slurmd logging and finally figured it 
out.


Thanks again for your help,

Thekla

On 28/5/21 2:15 μ.μ., Marcus Boden wrote:

Hi Thekla,

these are all the installed packages with x11 in the name:
libX11
libX11-common
libX11-devel
xorg-x11-apps
xorg-x11-drv-intel
xorg-x11-fonts-Type1
xorg-x11-font-utils
xorg-x11-proto-devel
xorg-x11-server-common
xorg-x11-server-utils
xorg-x11-server-Xorg
xorg-x11-utils
xorg-x11-xauth
xorg-x11-xbitmaps
xorg-x11-xinit
xorg-x11-xkb-utils

Hope that help!
Marcus


On 28.05.21 13:01, Thekla Loizou wrote:

Dear Marcus,

Thanks a lot once again for the reply.

Would it be easy for you to tell me which X development libraries you 
have on your system?


I cannot find something in the configure script..

Thanks,

Thekla

On 28/5/21 12:56 μ.μ., Marcus Boden wrote:
I have the same in our config.log and the x11 forwarding works fine. 
No other lines around it (about some failing checks or something), 
just this:


[...]
configure:22134: WARNING: unable to locate rrdtool installation
configure:22176: support for ucx disabled
configure:22296: checking whether Slurm internal X11 support is enabled
configure:22311: result:
configure:22350: checking for check >= 0.9.8
[...]

Best,
Marcus


On 28.05.21 09:26, Bjørn-Helge Mevik wrote:

Thekla Loizou  writes:


Also, when compiling SLURM in the config.log I get:

configure:22291: checking whether Slurm internal X11 support is 
enabled

configure:22306: result:

The result is empty. I read that X11 is build by default so I don't
expect a special flag to be given during compilation time right?


My guess is that some X development library is missing. Perhaps 
look in

the configure script for how this test was done (typically it will try
to compile something with those devel libraries, and fail). Then see
which package contains that library, install it and try again.









Re: [slurm-users] Building SLURM with X11 support

2021-05-28 Thread Thekla Loizou

Thank you both for your replies.

Our OS is CentOS 7.7. We have the dependencies installed and also the 
PrologFlags=X11 in the slurm.conf.


Perhaps I am missing some X11 packages? But X11 is working outside SLURM.

When getting interactive access on a node basically I get:

salloc -N1 --x11
salloc: Granted job allocation 4694
salloc: Waiting for resource configuration
salloc: Job allocation 4694 has been revoked.

Also, when compiling SLURM in the config.log I get:

configure:22291: checking whether Slurm internal X11 support is enabled
configure:22306: result:

The result is empty. I read that X11 is build by default so I don't 
expect a special flag to be given during compilation time right?



Thanks,

Thekla

On 27/5/21 3:23 μ.μ., Ole Holm Nielsen wrote:

 On 5/27/21 2:07 PM, Thekla Loizou wrote:

I am trying to use X11 forwarding in SLURM with no success.

We are installing SLURM using RPMs that we generate with the command 
"rpmbuild -ta slurm*.tar.bz2" as per the documentation.


I am currently working with SLURM version 20.11.7-1.

What I am missing when it comes to build SLURM with X11 enabled? 
Which flags and packages are required?


What is your OS?  Do you have X11 installed?

Did you install all Slurm prerequisites?  For CentOS 7 it is:

yum install rpm-build gcc openssl openssl-devel libssh2-devel 
pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel 
readline-devel rrdtool-devel ncurses-devel gtk2-devel libssh2-devel 
libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker


see 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-prerequisites


I hope this helps.

/Ole





[slurm-users] Building SLURM with X11 support

2021-05-27 Thread Thekla Loizou

Dear all,

I am trying to use X11 forwarding in SLURM with no success.

We are installing SLURM using RPMs that we generate with the command 
"rpmbuild -ta slurm*.tar.bz2" as per the documentation.


I am currently working with SLURM version 20.11.7-1.

What I am missing when it comes to build SLURM with X11 enabled? Which 
flags and packages are required?


Regards,

Thekla