Re: [slurm-users] Add partition to existing user association
Hi all, I agree with both Marcus and Loris. I am referring to modifying the association basically since when we first created our associations we had only "user" and "account" and now we also need to add the "partition". My understanding from the documentation was that I would be able to modify the association and add the partition but it seems that this is not the case. So I guess we will proceed with my original solution to delete all associations consisting of "user" and "account" and create new ones consisting of "user", "account" and "partition". Regards, Thekla On 24/1/22 3:54 μ.μ., Marcus Wagner wrote: Hi all, an association is a triple (quadruple if you have several clusters) consisting of "user", "account" and "partition". So, you need to add a association. I'm not sure, how the accounting works, if no partition is set. We are always setting that triple, automatically during first submission to this triple. Not all users / account are allowed to use all partitions. This is checked externally, and if the user is allowed to submit to a partition with a specific account, we add that triple with sacctmgr. Best Marcus Am 24.01.2022 um 14:38 schrieb Loris Bennett: Dear Thekla, Disclaimer: Firstly, I find account management in Slurm confusing and the documentation strangely unenlightening. Secondly, I don't make many changes to things once users have been set up, so I have very little experience of actually tweaking the accounting. Despite my understanding of the documentation that you *can* modify the partition of a user, I don't think this is actually the case. If I look at the database, the user table has no column 'partition' whereas the association table does. So you might be able to modify the association, but you might also just have to delete the association and recreate it with the desired partitions. Or you might have to do something entirely different ... Maybe people who do understand Slurm's account management can chip in. Cheers, Loris Thekla Loizou writes: Dear Dori, Thanks for your reply. Unfortunately this does not work either... Best, Thekla On 21/1/22 7:43 μ.μ., Dori Sajdak wrote: Hi Thekla, When it comes to partitions, I believe you need to specify the cluster so in your example: sacctmgr modify user thekla account=ops set partition=gpu where cluster=YourClusterName QOS is not tied to a specific cluster but partitions are. That should work for you. Dori *** Dori Sajdak (she/her/hers) Senior Systems Administrator Center for Computational Research University at Buffalo, State University of New York 701 Ellicott St Buffalo, New York 14203 Phone: (716) 881-8934 Fax: (716) 849-6656 Web: http://buffalo.edu/ccr Help Desk: https://ubccr.freshdesk.com Twitter: https://twitter.com/ubccr *** -Original Message- From: slurm-users On Behalf Of Thekla Loizou Sent: Friday, January 21, 2022 9:12 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] Add partition to existing user association Dear all, I was wondering if there is a way to add a partition to an existing user association. For example if I have an association of user thekla to an account ops I can set a qos for the existing association: sacctmgr modify user thekla account=ops set qos=nosubmit Modified user associations... C = cyclamen A = ops U = thekla However, I cannot set a partition: sacctmgr modify user thekla account=ops set partition=gpu Unknown option: partition=gpu Use keyword 'where' to modify condition This is not possible? The only solution I found to that is to delete the association and create it again with the partition: sacctmgr del user thekla account=ops sacctmgr add user thekla account=ops partition=gpu Thank you, Thekla
Re: [slurm-users] Add partition to existing user association
Dear Dori, Thanks for your reply. Unfortunately this does not work either... Best, Thekla On 21/1/22 7:43 μ.μ., Dori Sajdak wrote: Hi Thekla, When it comes to partitions, I believe you need to specify the cluster so in your example: sacctmgr modify user thekla account=ops set partition=gpu where cluster=YourClusterName QOS is not tied to a specific cluster but partitions are. That should work for you. Dori *** Dori Sajdak (she/her/hers) Senior Systems Administrator Center for Computational Research University at Buffalo, State University of New York 701 Ellicott St Buffalo, New York 14203 Phone: (716) 881-8934 Fax: (716) 849-6656 Web: http://buffalo.edu/ccr Help Desk: https://ubccr.freshdesk.com Twitter: https://twitter.com/ubccr *** -Original Message- From: slurm-users On Behalf Of Thekla Loizou Sent: Friday, January 21, 2022 9:12 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] Add partition to existing user association Dear all, I was wondering if there is a way to add a partition to an existing user association. For example if I have an association of user thekla to an account ops I can set a qos for the existing association: sacctmgr modify user thekla account=ops set qos=nosubmit Modified user associations... C = cyclamen A = ops U = thekla However, I cannot set a partition: sacctmgr modify user thekla account=ops set partition=gpu Unknown option: partition=gpu Use keyword 'where' to modify condition This is not possible? The only solution I found to that is to delete the association and create it again with the partition: sacctmgr del user thekla account=ops sacctmgr add user thekla account=ops partition=gpu Thank you, Thekla
[slurm-users] Add partition to existing user association
Dear all, I was wondering if there is a way to add a partition to an existing user association. For example if I have an association of user thekla to an account ops I can set a qos for the existing association: sacctmgr modify user thekla account=ops set qos=nosubmit Modified user associations... C = cyclamen A = ops U = thekla However, I cannot set a partition: sacctmgr modify user thekla account=ops set partition=gpu Unknown option: partition=gpu Use keyword 'where' to modify condition This is not possible? The only solution I found to that is to delete the association and create it again with the partition: sacctmgr del user thekla account=ops sacctmgr add user thekla account=ops partition=gpu Thank you, Thekla
Re: [slurm-users] Job array start time and SchedNodes
Dear Loris, Yes it is indeed a bit odd. At least now I know that this is how SLURM behaves and not something that has to do with our configuration. Regards, Thekla On 9/12/21 1:04 μ.μ., Loris Bennett wrote: Dear Thekla, Yes, I think you are right. I have found a similar job on my system and this does seem to be the normal, slightly confusing behaviour. It looks as if the pending elements of the array get assigned a single node, but then start on other nodes: $ squeue -j 8536946 -O jobid,jobarrayid,reason,schednodes,nodelist,state | head JOBID JOBID REASON SCHEDNODES NODELISTSTATE 8536946 8536946_[401-899] Resources g002 PENDING 8658719 8536946_400 None(null) g006RUNNING 8658685 8536946_399 None(null) g012RUNNING 8658625 8536946_398 None(null) g001RUNNING 8658491 8536946_397 None(null) g006RUNNING 8658428 8536946_396 None(null) g003RUNNING 8658427 8536946_395 None(null) g003RUNNING 8658426 8536946_394 None(null) g007RUNNING 8658425 8536946_393 None(null) g002RUNNING This strikes me as a bit odd. Cheers, Loris Thekla Loizou writes: Dear Loris, Thank you for your reply. I don't believe that there is something wrong with the job configuration or the node configuration to be honest. I have just submitted a simple sleep script: #!/bin/bash sleep 10 as below: sbatch --array=1-10 --ntasks-per-node=40 --time=09:00:00 test.sh and squeue shows: 131799_1 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_2 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_3 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_4 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_5 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_6 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_7 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_8 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_9 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_10 cpu test.sh thekla PD N/A 1 cn04 (Priority) All of the jobs seem to be scheduled on node cn04. When they start running they run on separate nodes: 131799_1 cpu test.sh thekla R 0:02 1 cn01 131799_2 cpu test.sh thekla R 0:02 1 cn02 131799_3 cpu test.sh thekla R 0:02 1 cn03 131799_4 cpu test.sh thekla R 0:02 1 cn04 Regards, Thekla On 7/12/21 5:17 μ.μ., Loris Bennett wrote: Dear Thekla, Thekla Loizou writes: Dear Loris, There is no specific node required for this array. I can verify that from "scontrol show job 124841" since the requested node list is empty: ReqNodeList=(null) Also, all 17 nodes of the cluster are identical so all nodes fulfill the job requirements, not only node cn06. By "saving" the other nodes I mean that the scheduler estimates that the array jobs will start on 2021-12-11T03:58:00. No other jobs are scheduled to run during that time on the other nodes. So it seems that somehow the scheduler schedules the array jobs on more than one nodes but this is not showing in the squeue or scontrol output. My guess is that there is something wrong with either the job configuration or the node configuration, if Slurm thinks 9 jobs which require a whole node can all be started simultaneously on same node. Cheers, Loris Regards, Thekla On 7/12/21 12:16 μ.μ., Loris Bennett wrote: Hi Thekla, Thekla Loizou writes: Dear all, I have noticed that SLURM schedules several jobs from a job array on the same node with the same start time and end time. Each of these jobs requires the full node. You can see the squeue output below: JOBID PARTITION ST START_TIME NODES SCHEDNODES NODELIST(REASON) 124841_1 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_2 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_3 cpu PD
Re: [slurm-users] Job array start time and SchedNodes
Dear Loris, Thank you for your reply. I don't believe that there is something wrong with the job configuration or the node configuration to be honest. I have just submitted a simple sleep script: #!/bin/bash sleep 10 as below: sbatch --array=1-10 --ntasks-per-node=40 --time=09:00:00 test.sh and squeue shows: 131799_1 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_2 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_3 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_4 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_5 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_6 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_7 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_8 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_9 cpu test.sh thekla PD N/A 1 cn04 (Priority) 131799_10 cpu test.sh thekla PD N/A 1 cn04 (Priority) All of the jobs seem to be scheduled on node cn04. When they start running they run on separate nodes: 131799_1 cpu test.sh thekla R 0:02 1 cn01 131799_2 cpu test.sh thekla R 0:02 1 cn02 131799_3 cpu test.sh thekla R 0:02 1 cn03 131799_4 cpu test.sh thekla R 0:02 1 cn04 Regards, Thekla On 7/12/21 5:17 μ.μ., Loris Bennett wrote: Dear Thekla, Thekla Loizou writes: Dear Loris, There is no specific node required for this array. I can verify that from "scontrol show job 124841" since the requested node list is empty: ReqNodeList=(null) Also, all 17 nodes of the cluster are identical so all nodes fulfill the job requirements, not only node cn06. By "saving" the other nodes I mean that the scheduler estimates that the array jobs will start on 2021-12-11T03:58:00. No other jobs are scheduled to run during that time on the other nodes. So it seems that somehow the scheduler schedules the array jobs on more than one nodes but this is not showing in the squeue or scontrol output. My guess is that there is something wrong with either the job configuration or the node configuration, if Slurm thinks 9 jobs which require a whole node can all be started simultaneously on same node. Cheers, Loris Regards, Thekla On 7/12/21 12:16 μ.μ., Loris Bennett wrote: Hi Thekla, Thekla Loizou writes: Dear all, I have noticed that SLURM schedules several jobs from a job array on the same node with the same start time and end time. Each of these jobs requires the full node. You can see the squeue output below: JOBID PARTITION ST START_TIME NODES SCHEDNODES NODELIST(REASON) 124841_1 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_2 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_3 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_4 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_5 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_6 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_7 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_8 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_9 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) Is this a bug or am I missing something? Is this because the jobs have the same JOBID and are still in pending state? I am aware that the jobs will not actually all run on the same node at the same time and that the scheduler somehow takes into account that this job array has 9 jobs that will need 9 nodes. I am creating a timeline with the start time of all jobs and when the job array jobs will start running no other jobs are set to run on the remaining nodes (so it "saves" the other nodes for the jobs of the array even if they are all scheduled to run on the same node based on squeue or scontrol). In general jobs from an array will be scheduled on whatever nodes fulfil their requirements. The fact that all the jobs have cn06 as NODELIST however seems to suggest that you have either specified cn06 as the node the jobs should run on, or cn06 is the only node which fulfils the job requirements. I'm not sure what you mean about '"saving" the other nodes'. Cheers, Loris
Re: [slurm-users] Job array start time and SchedNodes
Dear Loris, There is no specific node required for this array. I can verify that from "scontrol show job 124841" since the requested node list is empty: ReqNodeList=(null) Also, all 17 nodes of the cluster are identical so all nodes fulfill the job requirements, not only node cn06. By "saving" the other nodes I mean that the scheduler estimates that the array jobs will start on 2021-12-11T03:58:00. No other jobs are scheduled to run during that time on the other nodes. So it seems that somehow the scheduler schedules the array jobs on more than one nodes but this is not showing in the squeue or scontrol output. Regards, Thekla On 7/12/21 12:16 μ.μ., Loris Bennett wrote: Hi Thekla, Thekla Loizou writes: Dear all, I have noticed that SLURM schedules several jobs from a job array on the same node with the same start time and end time. Each of these jobs requires the full node. You can see the squeue output below: JOBID PARTITION ST START_TIME NODES SCHEDNODES NODELIST(REASON) 124841_1 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_2 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_3 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_4 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_5 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_6 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_7 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_8 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_9 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) Is this a bug or am I missing something? Is this because the jobs have the same JOBID and are still in pending state? I am aware that the jobs will not actually all run on the same node at the same time and that the scheduler somehow takes into account that this job array has 9 jobs that will need 9 nodes. I am creating a timeline with the start time of all jobs and when the job array jobs will start running no other jobs are set to run on the remaining nodes (so it "saves" the other nodes for the jobs of the array even if they are all scheduled to run on the same node based on squeue or scontrol). In general jobs from an array will be scheduled on whatever nodes fulfil their requirements. The fact that all the jobs have cn06 as NODELIST however seems to suggest that you have either specified cn06 as the node the jobs should run on, or cn06 is the only node which fulfils the job requirements. I'm not sure what you mean about '"saving" the other nodes'. Cheers, Loris
[slurm-users] Job array start time and SchedNodes
Dear all, I have noticed that SLURM schedules several jobs from a job array on the same node with the same start time and end time. Each of these jobs requires the full node. You can see the squeue output below: JOBID PARTITION ST START_TIME NODES SCHEDNODES NODELIST(REASON) 124841_1 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_2 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_3 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_4 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_5 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_6 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_7 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_8 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) 124841_9 cpu PD 2021-12-11T03:58:00 1 cn06 (Priority) Is this a bug or am I missing something? Is this because the jobs have the same JOBID and are still in pending state? I am aware that the jobs will not actually all run on the same node at the same time and that the scheduler somehow takes into account that this job array has 9 jobs that will need 9 nodes. I am creating a timeline with the start time of all jobs and when the job array jobs will start running no other jobs are set to run on the remaining nodes (so it "saves" the other nodes for the jobs of the array even if they are all scheduled to run on the same node based on squeue or scontrol). Regards, Thekla Loizou HPC Systems Engineer The Cyprus Institute
Re: [slurm-users] Building SLURM with X11 support
Hi all, I managed to solve the issue...xauth was missing from the compute nodes! I increased the debug level in the slurmd logging and finally figured it out. Thanks again for your help, Thekla On 28/5/21 2:15 μ.μ., Marcus Boden wrote: Hi Thekla, these are all the installed packages with x11 in the name: libX11 libX11-common libX11-devel xorg-x11-apps xorg-x11-drv-intel xorg-x11-fonts-Type1 xorg-x11-font-utils xorg-x11-proto-devel xorg-x11-server-common xorg-x11-server-utils xorg-x11-server-Xorg xorg-x11-utils xorg-x11-xauth xorg-x11-xbitmaps xorg-x11-xinit xorg-x11-xkb-utils Hope that help! Marcus On 28.05.21 13:01, Thekla Loizou wrote: Dear Marcus, Thanks a lot once again for the reply. Would it be easy for you to tell me which X development libraries you have on your system? I cannot find something in the configure script.. Thanks, Thekla On 28/5/21 12:56 μ.μ., Marcus Boden wrote: I have the same in our config.log and the x11 forwarding works fine. No other lines around it (about some failing checks or something), just this: [...] configure:22134: WARNING: unable to locate rrdtool installation configure:22176: support for ucx disabled configure:22296: checking whether Slurm internal X11 support is enabled configure:22311: result: configure:22350: checking for check >= 0.9.8 [...] Best, Marcus On 28.05.21 09:26, Bjørn-Helge Mevik wrote: Thekla Loizou writes: Also, when compiling SLURM in the config.log I get: configure:22291: checking whether Slurm internal X11 support is enabled configure:22306: result: The result is empty. I read that X11 is build by default so I don't expect a special flag to be given during compilation time right? My guess is that some X development library is missing. Perhaps look in the configure script for how this test was done (typically it will try to compile something with those devel libraries, and fail). Then see which package contains that library, install it and try again.
Re: [slurm-users] Building SLURM with X11 support
Thank you both for your replies. Our OS is CentOS 7.7. We have the dependencies installed and also the PrologFlags=X11 in the slurm.conf. Perhaps I am missing some X11 packages? But X11 is working outside SLURM. When getting interactive access on a node basically I get: salloc -N1 --x11 salloc: Granted job allocation 4694 salloc: Waiting for resource configuration salloc: Job allocation 4694 has been revoked. Also, when compiling SLURM in the config.log I get: configure:22291: checking whether Slurm internal X11 support is enabled configure:22306: result: The result is empty. I read that X11 is build by default so I don't expect a special flag to be given during compilation time right? Thanks, Thekla On 27/5/21 3:23 μ.μ., Ole Holm Nielsen wrote: On 5/27/21 2:07 PM, Thekla Loizou wrote: I am trying to use X11 forwarding in SLURM with no success. We are installing SLURM using RPMs that we generate with the command "rpmbuild -ta slurm*.tar.bz2" as per the documentation. I am currently working with SLURM version 20.11.7-1. What I am missing when it comes to build SLURM with X11 enabled? Which flags and packages are required? What is your OS? Do you have X11 installed? Did you install all Slurm prerequisites? For CentOS 7 it is: yum install rpm-build gcc openssl openssl-devel libssh2-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel gtk2-devel libssh2-devel libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker see https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-prerequisites I hope this helps. /Ole
[slurm-users] Building SLURM with X11 support
Dear all, I am trying to use X11 forwarding in SLURM with no success. We are installing SLURM using RPMs that we generate with the command "rpmbuild -ta slurm*.tar.bz2" as per the documentation. I am currently working with SLURM version 20.11.7-1. What I am missing when it comes to build SLURM with X11 enabled? Which flags and packages are required? Regards, Thekla