[slurm-users] Re: Limit GPU depending on type
Gestió Servidors via slurm-users wrote: > What I want is users could user all of them but simultaniously, a user only > could use one of the RTX3080. How about two partitions: One contains only the RTX3080, using the QoS MaxTRESPerUser=gres/gpu=1 and another one with all the other GPUs not having this QoS. Users then submit to both of these partitions. -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Avoiding fragmentation
Hi, I'm trying to figure out how to deal with a mix of few- and many-cpu jobs. By that I mean most jobs use 128 cpus, but sometimes there are jobs with only 16. As soon as that job with only 16 is running, the scheduler splits the next 128 cpu jobs into 96+16 each, instead of assigning a full 128 cpu node to them. Is there a way for the administrator to achieve preferring full nodes? The existence of pack_serial_at_end makes me believe there is not, because that basically is what I needed, apart from my serial jobs using 16 cpus instead of 1. Gerhard -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Suggestions for Partition/QoS configuration
thomas.hartmann--- via slurm-users wrote: > My idea was to basically have three partitions: > > 1. PartitionName=short MaxTime=04:00:00 State=UP Nodes=node[01-99] > PriorityTier=100 > 2. PartitionName=long_safe MaxTime=14-00:00:00 State=UP Nodes=node[01-50] > PriorityTier=100 > 3. PartitionName=long_preempt MaxTime=14-00:00:00 State=UP Nodes=nodes[01-99] > PriorityTier=40 PreemptMode=requeue I don't know why you consider preemption if you have short jobs, just wait for jobs to finish. My first approach would be to have two partitions, both of them containing all nodes, but diffent QoSes assigned to them, so you can limit the short jobs to a certain amount of cpus and also limit long jobs to a certain amount of cpus - maybe 80% for each of them. Gerhard -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: "Optimal" slurm configuration
Max Grönke via slurm-users wrote: > (b) introduce a "small" partition for the <4h jobs with higher priority but > we're unsure if this will block all the larger jobs to run... Just limit the number of cpus in that partition. Gerhard -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Memory used per node
Hello, I'm wondering if there's a way to tell how much memory my job is using per node. I'm doing #SBATCH -n 256 srun solver inputfile When I run sacct -o maxvmsize, the result apparently is the maxmimum VSZ of the largest solver process, not the maximum of the sum of them all (unlike when calling mpirun instead). When I sstat -o TresUsageInMax, I get the memory summed up over all nodes being used. Can I get the maximum VSZ per node? Gerhard -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
Re: [slurm-users] How to run one maintenance job on each node in the cluster
Jeffrey Tunison wrote: > Is there a straightforward way to create a batch job that runs once on every > node in the cluster? A wrapper around reboot configured as RebootProgram in slurm.conf?
Re: [slurm-users] Reproducible irreproducible problem (timeout?)
Laurence Marks wrote: > After some (irreproducible) time, often one of the three slow tasks hangs. > A symptom is that if I try and ssh into the main node of the subtask (which > is running 128 mpi on the 4 nodes) I get "Authentication failed". How about asking an admin to check why it hangs?
Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)
Tim Wickberg wrote: > A number of race conditions have been identified within the > slurmd/slurmstepd processes that can lead to the user taking ownership > of an arbitrary file on the system. Is it any different than the CVE-2023-41915 in PMIx or does it just have an additional number but it's the same issue? Or did anyone mis-type the number? I couldn't find any information on CVE-2023-41914. Gerhard
Re: [slurm-users] Aborting a job from inside the prolog
Alexander Grund wrote: > Although it may be better to not drain it, I'm a bit nervous with "exit > 0" as it is very important that the job does not start/continue, i.e. > the user code (sbatch script/srun) is never executed in that case. > So I want to be sure that an `scancel` on the job in its prolog is > actually always preventing the job from running. Just return the exit code of scancel, then. If it failed, the prolog failed and the job gets re-queued. If it didn't, the job was cancelled.
Re: [slurm-users] Aborting a job from inside the prolog
Alexander Grund wrote: > Our first approach with `scancel $SLURM_JOB_ID; exit 1` doesn't seem to > work as the (sbatch) job still gets re-queued. Try to exit with 0, because it's not your prolog that failed.
Re: [slurm-users] Does Slurm have any equivalent to LSF elim for generating dynamic node resources
Amir Ben Avi wrote: > I have looked on the Slurm documentation, but didn't found any way to crate > resource dynamically ( in a script ) on the node level Well, basically you could do something like scontrol update nodename=$HOSTNAME Gres=myres:367. What you don't have is decaying resource reservations to compensate for the delay between your elim replacement and job starts. Gerhard
Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?
Phil Chiu wrote: >- Individual slurm jobs which reboot nodes - With a for loop, I could >submit a reboot job for each node. But I'm not sure how to limit this so at >most N jobs are running simultaneously. With a fake license called reboot?
Re: [slurm-users] Jobs fail on specific nodes.
Roger Mason wrote: > I would appreciate any suggestions on what might be causing this problem > or what I can do to diagnose it. Run getent hosts node012 on all hosts to see which one can't resolve it.
Re: [slurm-users] Sharing a GPU
Eric F. Alemany wrote: > Another solution would be the vNVIDIA GPU > (Virtual GPU manager software). > You can share GPU among VM’s You can really *share* one, not just delegate one GPU to one VM?
Re: [slurm-users] Limit partition to 1 job at a time
Russell Jones wrote: > I suppose I am confused about how GrpJobs works. The manual shows: > > The total number of jobs able to run at any given time from an association > and its children QOS > > > It is my understanding an association is cluster + account + user. Would > this not just limit it to 1 job per user in the partition, not 1 job at a > time total in the partition? I'm using GrpTRES to limit the number of cores per partition - and thats not per user. So I'm assuming that MaxJobs is per user, GrpJobs is not.
Re: [slurm-users] Limit partition to 1 job at a time
Russell Jones wrote: > I am struggling to figure out how to do this. Any tips? Create a QoS with GrpJobs=1 and assign it to the partition?
Re: [slurm-users] How to checkout a slurm node?
Joe Teumer wrote: > However, if the user needs to reboot the node, set BIOS settings, etc then > `salloc` automatically terminates the allocation when the new shell is What kind of BIOS settings would a user need to change?
Re: [slurm-users] how to check what slurm is doing when job pending with reason=none?
taleinterve...@sjtu.edu.cn wrote: > But after submit, this job still stay at PENDING state for about 30-60s and > during the pending time sacct shows the REASON is "None". It's the default sched_interval=60 in your slurm.conf. Gerhard
[slurm-users] Draining hosts because of failing jobs
Hello, how do you implement something like "drain host after 10 consecutive failed jobs"? Unlike a host check script, that checks for known errors, I'd like to stop killing jobs just because one node is faulty. Gerhard
[slurm-users] scancel the solver, not MPI
Hi, I'm wondering if it's possible to gracefully terminate a solver that is using MPI. If srun starts the MPI for me, can it tell the solver to terminate and then wait n seconds before it tells MPI to terminate? Or is the only way of handling this using scancel -b and trapping the signal?
Re: [slurm-users] Limit nodes of a partition without managing users
Brian Andrus wrote: > Most likely, but the specific approach depends on how you define what > you want. My idea was "high prio job is next unless are are too many of them". > For example, what if there are no jobs in high pri queue but many in > low? Should all the low ones run? Yes. > What should happen if they get started > and use all the nodes and a high-pri request comes in (preemption policy)? No preemption. > What about the inverse of that? The inverse of what? All nodes being used by high prio jobs? That's exactly what I want to avoid. > What if you get a steady stream of > high-pri jobs? How long should low-pri wait before being allowed to run? As long as it takes. Since I'm trying to avoid high prio jobs consuming all nodes, it won't take forever. :-) > Does it matter if it is all the same user?1 No. > You can handle much of that type of interaction with job priorities and > a single queue. As you can see, the devil is in the details on how to > define/get what you want. How do you make sure the single partition doesn't run high prio jobs only if there's a sufficient amout of those? Gerhard
Re: [slurm-users] [External] Limit nodes of a partition without managing users
Prentice Bisbal wrote: >> I'm wondering if it's possible to have slurm 19 run two partitions (low >> and high prio) that share all the nodes and limit the high prio >> partition in number of nodes used simultaneously without requiring to >> manage the users in the database. > Yes, you can do this using Slurm's QOS facility I don't think so. The documentation says setting "AccountingStorageEnforce=qos" is equal to "AccountingStorageEnforce=qos,associations" and "associations - This will prevent users from running jobs if their association is not in the database.". Gerhard
[slurm-users] Limit nodes of a partition without managing users
Hello, I'm wondering if it's possible to have slurm 19 run two partitions (low and high prio) that share all the nodes and limit the high prio partition in number of nodes used simultaneously without requiring to manage the users in the database. Any ideas? Regards, Gerhard
Re: [slurm-users] Debugging communication problems
Gerhard Strangar wrote: > I'm experiencing a connectivity problem and I'm out of ideas, why this > is happening. I'm running a slurmctld on a multihomed host. > > (10.9.8.0/8) - master - (10.11.12.0/8) > There is no routing between these two subnets. My topology.conf contained a loop, which resulted in incorrect message forwarding. Gerhard
[slurm-users] Debugging communication problems
Hi, I'm experiencing a connectivity problem and I'm out of ideas, why this is happening. I'm running a slurmctld on a multihomed host. (10.9.8.0/8) - master - (10.11.12.0/8) There is no routing between these two subnets. So far, all slurmds resided in the first subnet and worked fine. I added some in the second subnet and they keep changing into the DOWN state. I checked the "last slurmd control message" and sometimes it's overdue for 20 minutes and more with a configured slurmd timeout of 5 minutes. I did a tcpdump and it showed that the slurmctld isn't even trying to connect to the slurmds at that time. I haven't found any packet loss yet, the redundant DNS servers are both resolving the host names properly at that time and slurmctld just states a communications error for the ping request while slurmds are running and all hosts are idle. What reasons can there be for not contacting the slurmds? Or is it more likely that the reply gets lost on its way? Gerhard