[slurm-users] Re: Slurm fails before nvidia-smi command

2024-07-29 Thread Sarlo, Jeffrey S via slurm-users
nvidia-persistenced is something that gets installed by the nvidia driver. Setting it to start at boot time helps with slurmd being able to find the GPUs when it tries to start. This is just one web page that has some information about it. https://download.nvidia.com/XFree86/Linux-x86_64/396.

Re: [slurm-users] Job cancelled into the future

2022-12-20 Thread Sarlo, Jeffrey S
Do they show up as run away jobs? sacctmgr show runawayjobs If they do, it should give you the option to fix them. Jeff From: slurm-users On Behalf Of Reed Dier Sent: Tuesday, December 20, 2022 9:54 AM To: Slurm User Community List Subject: [slurm-users] Job cancelled into the future Hoping

Re: [slurm-users] Prolog and job_submit

2022-10-29 Thread Sarlo, Jeffrey S
Not sure if this will help. It has which user will execute the scripts https://slurm.schedmd.com/prolog_epilog.html Maybe the variable isn't set for the user executing the prolog/epilog/taskprolog Jeff From: slurm-users on behalf of Davide DelVento Sent: Sat

Re: [slurm-users] Epilog script does not execute

2022-07-18 Thread Sarlo, Jeffrey S
It could be because the epilog script doesn't have a PATH set by default for security, so maybe it isn't finding the commands echo or chmod https://slurm.schedmd.com/prolog_epilog.html Do you have /var/slurm/etc/epilog-test on your nodes? Jeff From: slurm-users

Re: [slurm-users] How to Make AvailableFeatures Persist after Slurmctld Restart

2022-06-02 Thread Sarlo, Jeffrey S
In slurm.conf, we just add the Features to the node description. Is that what you were looking for? NodeName=compute-4-4 ... Weight=15 Feature=gen10 Jeff UH IT - HPC From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Hanby, Mike Sent: Thursday, June 2, 2022 2:06 PM

Re: [slurm-users] Node is not allocating all CPUs

2022-04-06 Thread Sarlo, Jeffrey S
Are the jobs getting assigned memory amounts that would only allow 16 processors to be used when the jobs are running on the node? Jeff From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Guertin, David S. Sent: Wednesday, April 6, 2022 9:21 AM To: slurm-users@lists.sc

Re: [slurm-users] Changing DefaultAccount for user

2021-11-23 Thread Sarlo, Jeffrey S
They only way I found was to add the user to the new account and then modify their default account to the new account. sacctmgr add user example account=groupb sacctmgr modify user where user=example set defaultaccount=groupb Then you can remove them from the original default account if you want

Re: [slurm-users] fix missing accounting entries

2021-03-01 Thread Sarlo, Jeffrey S
Were you thinking of this * Report current jobs that have been orphanded on the local cluster and are now runaway: sacctmgr show RunawayJobs From: slurm-users on behalf of Brian Andrus Sent: Monday, March 1, 2021 11:14 AM To: slurm-users@lists.schedmd.co

Re: [slurm-users] prolog not passing env var to job

2021-02-12 Thread Sarlo, Jeffrey S
In our taskprolog file we have something like #!/bin/sh echo export SCRATCHDIR=/scratch/${SLURM_JOBID} From: slurm-users on behalf of Herc Silverstein Sent: Friday, February 12, 2021 3:12 PM To: slurm-us...@schedmd.com Subject: [slurm-users] prolog not passin

Re: [slurm-users] changes in slurm.

2020-07-10 Thread Sarlo, Jeffrey S
If you run slurmd -C on the compute node, it should tell you what slurm thinks the RealMemory number is. Jeff From: slurm-users on behalf of navin srivastava Sent: Friday, July 10, 2020 6:24 AM To: Slurm User Community List Subject: Re: [slurm-users] change

Re: [slurm-users] Slurm not detecting gpu after swapping out gpu

2020-04-27 Thread Sarlo, Jeffrey S
How do you have fabricnode2 defined in your gres.conf file and the slurm.conf file? Since the type of gpu changed, maybe the definition for it needs to be updated also. Jeff From: slurm-users on behalf of Dean Schulze Sent: Monday, April 27, 2020 11:47 AM To

Re: [slurm-users] Slurm configuration, Weight Parameter

2019-12-05 Thread Sarlo, Jeffrey S
We have weights and priority/multifactor. Jeff From: Sistemas NLHPC [mailto:siste...@nlhpc.cl] Sent: Thursday, December 05, 2019 12:01 PM To: Sarlo, Jeffrey S; Slurm User Community List Subject: Re: [slurm-users] Slurm configuration, Weight Parameter Thanks Jeff ! We upgrade slurm to 18.08.4

[slurm-users] Getting information about AssocGrpCPUMinutesLimit for a job

2019-08-07 Thread Sarlo, Jeffrey S
We had a job queued waiting for resources and when we changed the debug level, we were able to get the following in the slurmctld.log file. [2019-08-02T10:03:47.347] debug2: JobId=804633 being held, the job is at or exceeds assoc 50(jeff/(null)/(null)) group max tres(cpu) minutes of 3000 of

Re: [slurm-users] Slurm node weights

2019-07-25 Thread Sarlo, Jeffrey S
_ From: slurm-users on behalf of Sarlo, Jeffrey S Sent: 25 July 2019 13:04 To: Slurm User Community List Subject: Re: [slurm-users] Slurm node weights This is the fix if you want to modify the code and rebuild https://github.com/SchedMD/slurm/commit/f66a2a3e2064<https://eur03.safelinks.

Re: [slurm-users] Slurm node weights

2019-07-25 Thread Sarlo, Jeffrey S
deploy that new version for quite a while. In the meantime does anyone know if there any fix or alternative strategy that might help us to achieve the same result? Best regards, David From: slurm-users on behalf of Sarlo, Jeffrey S Sent: 25 July 2019 12:26

Re: [slurm-users] Slurm node weights

2019-07-25 Thread Sarlo, Jeffrey S
Which version of Slurm are you running? I know some of the earlier versions of 18.08 had a bug and node weights were not working. Jeff From: slurm-users on behalf of David Baker Sent: Thursday, July 25, 2019 6:09 AM To: slurm-users@lists.schedmd.com Subjec