[slurm-users] NVML autodetect "Failed to get supported memory frequencies" error

2021-03-04 Thread Joshua Baker-LePain
We're in the midst of transitioning our SGE cluster to slurm 20.02.6, running on up-to-date CentOS-7. We built RPMs from the standard tarball against CUDA 10.1. These RPMs worked just fine on our first GPU test node (with Tesla K80s) using "AutoDetect=nvml" in /etc/gres.conf. However, we

Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Chin,David
My mistake - from slurm.conf(5): SrunProlog runs on the node where the "srun" is executing. i.e. the login node, which explains why the directory is not being created on the compute node, while the echos work. -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu

Re: [slurm-users] [External] Re: exempting a node from Gres Autodetect

2021-03-04 Thread Paul Brunk
Hi all: Prentice wrote: > I don't see how that bug is related. That bug is about requiring the > libnvidia-ml.so library for an RPM that was built with NVML > Autodetect enabled. His problem is the opposite - he's already using > NVML autodetect, but wants to disable that feature on a single

Re: [slurm-users] slurm and device cgroups

2021-03-04 Thread Ransom, Geoffrey M.
Well, reading the source it looks like xcgroup_set_params is just writing to the devices.allow and devices.deny files. I haven't yet found what cg->path is being set to but presumably it is too /sys/fs/cgroup/slurm/uid_##/job_#/step_0 or equivalent for the job in question. I'm

[slurm-users] slurm and device cgroups

2021-03-04 Thread Ransom, Geoffrey M.
Hello I am trying to debug an issue with EGL support (updated NVIDIA drivers and now EGLGetDisplay and EGLQueryDevicesExt are failing if they can't access all /dev/nvidia# devices in slurm) and am wondering how slurm uses device cgroups so I can implement the same cgroup setup by hand and

Re: [slurm-users] [External] Re: About sacct --format: how can I get info about the fields

2021-03-04 Thread Prentice Bisbal
They're also listed on the sacct online man page: https://slurm.schedmd.com/sacct.html Scroll down until you see the text box with the white text on a black background - you can't miss it. Also,  depending how your parsing the output, you might want to skip printing out the headers, which

Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Chin,David
Hi Brian: This works just as I expect for sbatch. The example srun execution I showed was a non-array job, so the first half of the "if []" statement holds. It is the second half, which deals with job arrays, which has the period. The value of TMP is correct, i.e. "/local/scratch/80472" And

Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Chin,David
Hi, Brian: So, this is my SrunProlog script -- I want a job-specific tmp dir, which makes for easy cleanup at end of job: #!/bin/bash if [[ -z ${SLURM_ARRAY_JOB_ID+x} ]] then export TMP="/local/scratch/${SLURM_JOB_ID}" export TMPDIR="${TMP}" export LOCAL_TMPDIR="${TMP}" export

Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Brian Andrus
It seems to me, if you are using srun directly to get an interactive shell, you can just run the script once you get your shell. You can set the variables and then run srun. It automatically exports the environment. If you want to change a particular one (or more), use something like

[slurm-users] Set all Fairshares manually

2021-03-04 Thread Michael Müller
Dear Slurm users, we are running a cluster that has a flat account structure. All accounts have a monthly limit that can only change on the 1st of a month. Users assigned to the very same account shall not compete against each other (created with fairshare=parent) and their fairshare shall be