[slurm-users] Re: Limit GPU depending on type

2024-06-14 Thread Gestió Servidors via slurm-users
Hi, because of my real scenario (in mi first post I explained my testing scenario), with several differents users of differents types (researchers, university students and/or teachers, etc), I have distributed my GPUs in 3 differents partitions: * PartitionName=cuda-staff.q Nodes=gpu-[1-4]

[slurm-users] Limit GPU depending on type

2024-06-12 Thread Gestió Servidors via slurm-users
Hello, I would like to know if it would be possible to limit, using "sacctmgr", use of a certain type of GPU according the name I have assigned in "gres.conf" file. For example, my small cluster has 3 GPUs nodes sharing 2 GPUs each one. Two of that GPUs are the same model but they are located i

[slurm-users] Re: Problems with gres.conf

2024-06-05 Thread Gestió Servidors via slurm-users
Hi, my GPU testing system (named “gpu-node”) is a simple computer with one socket and a processor " Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz". Executing "lscpu", I can see there are 4 cores per socket, 2 threads per core and 8 CPUs: Architecture: x86_64 CPU op-mode(s):32-bit,

[slurm-users] Problems with gres.conf

2024-05-20 Thread Gestió Servidors via slurm-users
Hello, I am trying to rewrite my gres.conf file. Before changes, this file was just like this: NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0 Cores=0-11 NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1 Cores=12-23 NodeName=no

[slurm-users] Invalid/incorrect gres.conf syntax

2024-05-20 Thread Gestió Servidors via slurm-users
Hello, I have configured my "gres.conf" in this way: NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0 Cores=0-11 NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1 Cores=12-23 NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeFor

[slurm-users] Apply an specific QoS to all users that belongs to an specific account

2024-05-20 Thread Gestió Servidors via slurm-users
Hi, I would like to know if it is possible to apply an specific QoS to all users that belongs to an specific account. For example, I have created some new users "user_XX" and, also, I have created their new accounts in SLURM with "sacctmgr create account name=Test" and "sacctmgr create user nam

[slurm-users] Problems with gres.conf

2024-05-09 Thread Gestió Servidors via slurm-users
Hello, I am trying to rewrite my gres.conf file. Before changes, this file was just like this: NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0 Cores=0-11 NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1 Cores=12-23 NodeName=no

[slurm-users] Invalid/incorrect gres.conf syntax

2024-05-06 Thread Gestió Servidors via slurm-users
Hello, I have configured my "gres.conf" in this way: NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0 Cores=0-11 NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1 Cores=12-23 NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeFor

[slurm-users] Apply an specific QoS to all users that belongs to an specific account

2024-04-23 Thread Gestió Servidors via slurm-users
Hi, I would like to know if it is possible to apply an specific QoS to all users that belongs to an specific account. For example, I have created some new users "user_XX" and, also, I have created their new accounts in SLURM with "sacctmgr create account name=Test" and "sacctmgr create user nam

[slurm-users] Association limit problem

2024-04-17 Thread Gestió Servidors via slurm-users
Hello, I'm doing some test with "associations" with "sacctmgr". I have created three users (user_1, user_2 and user_3). For each of these users, I have created an association: [root@myserver log]# sacctmgr show user user_1 --associations User Def Acct AdminClusterAccount Pa

[slurm-users] Re: Lua script

2024-03-21 Thread Gestió Servidors via slurm-users
Hello, I answer about my question: * What is the contents of your /etc/slurm/job_submit.lua file? function slurm_job_submit(job_desc, part_list, submit_uid) if (job_desc.user_id == 1008) then slurm.log_info("Trabajo sometido por druiz") if (job_d

[slurm-users] Re: Lua script

2024-03-20 Thread Gestió Servidors via slurm-users
Hello, after adding "EnforcePartLimits=ALL" in slurm.conf and restarting slurmctld daemon, job continues being accepted... so I don't undertand where I'm doing some wrong. My slurm.conf is this: ControlMachine=my_server MailProg=/bin/mail MpiDefault=none ProctrackType=proctrack/linuxproc Return

[slurm-users] Re: Lua script

2024-03-06 Thread Gestió Servidors via slurm-users
And how can I reject the job inside the lua script? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Lua script

2024-03-06 Thread Gestió Servidors via slurm-users
Hello, I'm writing a small lua script that for modify "TimeLimit" of a submited job if user has configured a TimeLimit bigger that configured in the partition. So, is TimeLimit for partition is, for example, 4 hours (04:00:00) and user submit his/her job with a TimeLimit of 5 hours, lua script

[slurm-users] Question about CPUs and cores

2024-01-25 Thread Gestió Servidors
Hi, I want to run a simple test that uses one node and four cores. Also, in my script, I execute a binary that reports me in what core is running one of the four tasks. These are my files: * submit script: #!/bin/bash #SBATCH --job-name=test_jobs # Job name #SBATCH --output=test_jo

Re: [slurm-users] Slurm version 23.11 is now available

2023-11-24 Thread Gestió Servidors
Hello, Some days ago, I started to configure a new server with SLURM 23.02.5. Yesterday, I read in this mailing list that version 23.11.0 was released, so today I have compiled this latest version. However, after starting slurmdbd (with a database upgrade), I have got problems with slurmctld, b

[slurm-users] Configure a user as "admin" only in his/her account

2023-10-18 Thread Gestió Servidors
Hello, I would like to if it possible to configure a user as "admin" only for his/her account. For example, in my accounting tree I have an account called "students" with users "student-1", "student-2" and so on. In this account, there are a user called "teacher" that must have privileges to ca

[slurm-users] cpu-bind=MASK at output files

2023-06-27 Thread Gestió Servidors
Hello, Running this simple script: #!/bin/bash # #SBATCH --job-name=mega_job #SBATCH --output=mega_job.out #SBATCH --tasks=3 #SBATCH --array=0-5 #SBATCH --partition=cuda.q echo "STARTING" srun echo "hello world" >> file_${SLURM_ARRAY_TASK_ID}.out echo "ENDING" I always get this output: STARTING

[slurm-users] Question about CPU and core binding

2022-10-20 Thread Gestió Servidors
Hi, I have run two scripts that takes 2 nodes and 8 tasks per node. First script runs with "--distribution=block:block" and second with "--distribution=cyclic:block". As far as I understand, in the first case, with "--distribution=block:block", job has been executed in this way (and I think it

Re: [slurm-users] TimeLimit parameter

2021-12-03 Thread Gestió Servidors
Hi, Answering between lines... > Hi; > > The EnforcePartLimits parameter in slurm.conf, should be set to ALL or ANY > to enforce time limit for partition. > > Regards. > > Ahmet M. I have not configured "EnforcePartLimits" in my slurm.conf file, so I suppose that my SLURM is runnin

[slurm-users] TimeLimit parameter

2021-12-02 Thread Gestió Servidors
Hello, I'm going a problema I have detected in my SLURM cluster. If I configure a partition with a "TimeLimit" of, for example, 15 minutes and, later, a user submits a job in which he/she apply a "TimeLimitt" bigger (for example, 20 minutes), job remains in PENDING state because TimeLimit reque

Re: [slurm-users] Error " slurm_receive_msg_and_forward: Zero Bytes were transmitted or received"

2021-12-01 Thread Gestió Servidors
Hi, I can't syncronize before with "ntpdate" because when I run "ntpdate -s my_NTP_server", I only received message "ntpdate: no server suitable for synchronization found"... Thanks.-- [cid:image001.jpg@01D7E6C2.E78DE900] Daniel Ruiz Molina Tècnic Mitjà Informàtic Arquitec

[slurm-users] Error " slurm_receive_msg_and_forward: Zero Bytes were transmitted or received"

2021-11-30 Thread Gestió Servidors
Hello, In last days, my nodes are showing error "slurm_receive_msg_and_forward: Zero Bytes were transmitted or received". After reviewing all configuration, I have notice that problem is the time difference between nodes and server. If nodes are "bad" configured (time in the future or in the pa

[slurm-users] Information about finished jobs

2021-06-13 Thread Gestió Servidors
Hello, How can I get all information about a finished job in the same way as "scontrol show jobid=" when job is pending or running? Thanks.

Re: [slurm-users] Job requesting two different GPUs on two

2021-06-11 Thread Gestió Servidors
Hi, I have tried with > > #!/bin/bash > # > #SBATCH --job-name=N2n4 > #SBATCH --partition=cuda.q > #SBATCH --output=N2n4-CUDA.txt > #SBATCH -N 1 # number of nodes with the first GPU > #SBATCH -n 2 # number of cores > #SBATCH --gres=gpu:GeForceRTX3080:1 > #SBATCH hetjob > #SBATCH -N 1 # number of

Re: [slurm-users] Job requesting two different GPUs on two

2021-06-10 Thread Gestió Servidors
Hello, No, with "#SBATCH --gres=gpu:2" SLURM searchs a node with 2 GPUs but I need to run my job in 2 nodes using 2 GPUs but one GPU in each node. If both GPUs are the same, job runs OK, but I want to test run my job in two nodes: one offers a GeForceRTX3080 and the second offers a GeForceRTX20

[slurm-users] Job requesting two different GPUs on two different nodes

2021-06-08 Thread Gestió Servidors
Hi, Today, doing some tests, I have not got a solution to write a submit script that requests 2 different GPUs on 2 different nodes. With this simple script: #!/bin/bash # #SBATCH --job-name=N2n4 #SBATCH --output=N2n4-CUDA.txt #SBATCH --gres=gpu:GeForceRTX3080:1 #SBATCH -N 2 # number of nodes #

[slurm-users] Suspended and released job continues running in a "down" partition

2021-03-24 Thread Gestió Servidors
Hi, I have got this new question for you: In my cluster there is a running job. Then, I change a partition state from "up" to "down". Then, that job continues "running" because it was already running before the state had changed. Now, I run explicitly a "scontrol suspend my_job". After it, my

Re: [slurm-users] MaxTime only for a user

2021-02-25 Thread Gestió Servidors
Hi, After configuring "MaxWallDurationPerJob" and not get any good result (job with a large sleep continues running although MaxWallDurationPerJob=1 (1 minute)), now I have test a time_limit reconfiguration within a "lua" script. My "lua" script contains these lines: [...] if (job_desc.u

Re: [slurm-users] MaxTime only for a user

2021-02-25 Thread Gestió Servidors
Hi, I have test with "sacctmgr modify user name=my_user set MaxWallDurationPerJob=01:00" (in other words, user “my_user” will have only 1 minute per job), but after that, I have submit a job as “my_user” with a “sleep” of 50 minutes and jobs has NOT been cancelled… so something is wrong ☹

[slurm-users] MaxTime only for a user

2021-02-25 Thread Gestió Servidors
Hi, I need to configure a SLURM partition to allow jobs than need more than a hour, but only for a specific user. By default, that partition allows jobs with a "MaxTime=10:00" but, now, a user needs to run some test in the same partition that will last one one aprox. If I configure a "MaxTime"

[slurm-users] Fairshare tree after SLURM upgrade

2021-01-28 Thread Gestió Servidors
Hello, I'm going to upgrade my SLURM version from 17.11.5 to 19.05.1. I know this is not the last version, but I manage another cluster that is running, also, this version. My question is: during the process, I need to upgrade "slurmdbd". All the fairshare tree (with rawusage, effectvusage, fai

Re: [slurm-users] Using "Environment Modules"

2021-01-26 Thread Gestió Servidors
Hi, My environment is this: * Users are using "bash" as the default shell * A sample of one of my environment modules is this: #%Module1.0 ## ## modules modulefile ## ## modulefiles/modules. Generated from modules.in by configure. ## set ModulesVersion "3.2.10" proc ModulesHelp {

[slurm-users] Using "Environment Modules" in a SLURM script

2021-01-22 Thread Gestió Servidors
Hello, I use "Environment Modules" (http://modules.sourceforge.net/) in my SLURM cluster. In my scripts I do need to add an explicit "source /soft/modules-3.2.10/Modules/3.2.10/init/bash". However, in several examples I have read about SLURM scripts, nobody comments that. So, have I forgotten a

[slurm-users] OpenMP job and not expected results

2021-01-22 Thread Gestió Servidors
Hello, I'm running this script in a cluster composed by 11 nodes, each one with 1 processor with 4 cores and 1 thread per core: #!/bin/bash #SBATCH --job-name=hellohybrid #SBATCH --output=hellohybrid.out #SBATCH --ntasks=4 #SBATCH --cpus-per-task=3 #SBATCH --partition=nodes # Load the default

[slurm-users] Doubts with Fairshare

2020-12-01 Thread Gestió Servidors
Hello, My SLURM cluster is applying "FairShare" with these values: PriorityType=priority/multifactor PriorityDecayHalfLife=7-0 PriorityCalcPeriod=5 PriorityUsageResetPeriod=QUARTERLY PriorityFavorSmall=NO PriorityMaxAge=7-0 PriorityWeightAge=1 PriorityWeightFairshare=100 PriorityWeightJobS

[slurm-users] "slurmd" daemon doesn't appears in systemd tree

2020-11-24 Thread Gestió Servidors
Hello, I would like to know if it is normal not to see "slurmd" daemon in systemd services tree. I have run "systemd-analyze plot > /tmp/plot.txt" and, then, I have search "slurmd" in that file, but no match is found. I comment it because I would like if it could be a SLURM problem or a systemd

[slurm-users] Node random selection

2020-10-30 Thread Gestió Servidors
Hello, My students cluster has 12 computers that act as "execution node". I have configured a partition where these 12 computers are defined. When someone submits a job that requires only one computer, if 12 computers are available, always job runs in the first defined computer in slurm.conf.

Re: [slurm-users] Slurmctld and log file

2020-09-09 Thread Gestió Servidors
Hello, This seems to imply you had some changes in your slurm.conf I'm presuming you are running Centos 7 or such. Do you see anything when you do 'journalctl -u slurmctld' I'm wondering if you were only logging to the journal and then added the bits to also/instead log to a separate file. I d

Re: [slurm-users] Slurmctld and log file

2020-09-08 Thread Gestió Servidors
Hello, My slurm logrotate file looks like this: > /var/log/slurm/*.log { > weekly > compress > missingok > nocopytruncate > nocreate > nodelaycompress > nomail > notifempty > noolddir > rotate 5 > sharedscripts > size=5M > create

[slurm-users] Slurmctld and log file

2020-09-08 Thread Gestió Servidors
Hello, I don't know why, but my SLURM server (that is running fine) has its slurmdctl.log file with size 0 bytes... so... where is writting logs? It seems that log file has 0 bytes from logrotate process during today's early morning. My logrotate SLURM conf is this: [root@server logrotate.d]# c

[slurm-users] Slurmctld and log file

2020-09-08 Thread Gestió Servidors
Hello, I don't know why, but my SLURM server (that is running fine) has its slurmdctl.log file with size 0 bytes... so... where is writting logs? It seems that log file has 0 bytes from logrotate process during today's early morning. My logrotate SLURM conf is this: [root@server logrotate.d]# c

[slurm-users] Submitting jobs with constraint option

2020-09-03 Thread Gestió Servidors
Hello, I would like to apply some constraint options to my nodes. For example, infiniband available, processor model, etc., but I don't know where I need to detail that information. I know that in "sbatch" I can request for that details with "--constraint=" but I suppose that I need to define t

Re: [slurm-users] Reset Fair-share tree account values

2020-07-17 Thread Gestió Servidors
Hi, I think this answer solves my problem `sacctmgr` can be used to reset the accrued RawUsage value. Example usage: # sacctmgr modify user where Account= set RawUsage=0 Review the `sacctmgr` documentation for more details: https://slurm.schedmd.com/sacctmgr.html Best, Sebastian Thanks!!

[slurm-users] Reset Fair-share tree account values

2020-07-16 Thread Gestió Servidors
Hello, I will try to explain an scenario that occurs in my SLURM cluster. An important number of users (accounts) belongs to students of a certain subject. That subject is 6 month duration. When subject end, I "reset" user folders, clean all data, reset passwords and, in next academic year, I o

Re: [slurm-users] Module "pam_slurm_adopt"

2020-07-01 Thread Gestió Servidors
Hi, My system runs SLURM 19.05.4 and system runs CentOS Linux 7.7.1908. Always I have installed SLURM from source code because destination folder is a NFS folder shared between all nodes. I know I could install a RPM in a custom destination, but... installing from source code allows me configur

[slurm-users] Module "pam_slurm_adopt"

2020-07-01 Thread Gestió Servidors
Hello, I want to limit users to allow SSH connection to compute nodes. I have read at https://slurm.schedmd.com/pam_slurm_adopt.html that "pam_slurm_adopt" allows a SSH connection if and only if that user has a job (or more than one) running in that node. However, my SLURM system (19.05.4) hasn

Re: [slurm-users] fail job

2020-06-30 Thread Gestió Servidors
Can you post, also, slurmdctl.conf log file from server (controller)?

[slurm-users] Tool-wrapper "sinteractive"

2020-06-25 Thread Gestió Servidors
Hello, A user of my cluster needs "sinteractive" tool (or wrapper) but this tools is not installed in my SLURM. I have checked source files and tree and it doesn't appear. How can I install (and use) this tool-wrapper? Thanks.

[slurm-users] MaxJobs not working

2020-05-18 Thread Gestió Servidors
Hi, Some minutes ago, I have applied "MaxJobs=3" for an user. After that, if I ran "sacctmgr -s show user MYUSER format=account,user,maxjobs", system showed a "3" at the maxjobs column. However, now, I have run a "squeue" and I'm seeing 4 jobs (from that user) in "running" state... Shouldn't it

[slurm-users] Show "maxjobs"

2020-05-18 Thread Gestió Servidors
Hi, I have applied "maxjobs" in accounting only for a user (not account), so the others users in the same account have "infinite" maxjobs, but a user have 3 (the number I have configured). If I run "sacctmgr -s show user MYUSER format=user,maxjobs" I can see that 3 but how could I run "sacctmgr

[slurm-users] How to get command from a finished job

2020-04-30 Thread Gestió Servidors
Hello, I would like to know if there exist any way to get the same information I can get from a running or pending job in the queue with "scontrol show jobid=" when the job has finished. When it has finished, "scontrol show jobid=" doesn't work and "sacct -j jobid" doesn't show all the

[slurm-users] Show detailed information from a finished job

2020-04-23 Thread Gestió Servidors
Hello, When a job is "pending" or "running", with "scontrol show jobid=#jobjumber" I can get some usefull information, but when the job has finished, that command doesn't return anything. For example, if I run a "sacct" and I see that some jobs have finished with state "FAILED", how can I get d

Re: [slurm-users] Normal user cancelling a job

2020-03-17 Thread Gestió Servidors
With this command can I add a "normal" user to SLURM with "scancel" privileges over jobs in the same group? sacctmgr add coordinator account= names= At official SLURM documentation (https://slurm.schedmd.com/sacctmgr.html), I have read this: ENTITIES account A bank acc

Re: [slurm-users] Normal user cancelling a job

2020-03-16 Thread Gestió Servidors
and how can I add "Account Administrators" ? in the accounting database? or in a configuration file?

Re: [slurm-users] Job not cancelled after "TimeLimit" supered

2020-03-10 Thread Gestió Servidors
Hello, I have checked my configuration with "scontrol show config" and these are the values of that three parameters: AccountingStorageEnforce = none EnforcePartLimits = NO OverTimeLimit = 500 min ...so now I understand by my job hasn't been cancelled after 8 hours... because th

[slurm-users] Only one socket for SLURM

2019-02-18 Thread Gestió Servidors
Hi, One node of my cluster has 2 CPU sockets (with 2 32-cores CPUs). Now, I would like to configure my SLURM to share only CPUs of first socket. I have configured slurm.conf in this way: NodeName=mynode CPUs=32 SocketsPerBoard=1 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=515703 TmpDisk=27000

[slurm-users] Federated Clusters

2019-02-12 Thread Gestió Servidors
Hi, I would like to know if "federated clusters in SLURM" concept allows connecting two SLURM clusters that are completely separate (one controller for each cluster, only sharing users via NFS and NIS). Thanks.