Re: [slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

2023-01-12 Thread Daniel Letai
Hello Cristóbal, I think you might have a slight misunderstanding of how Slurm works, which can cause this difference in expectation. The MaxMemPerNode is there to allow the scheduler to plan job placement according to resources. It does not

[slurm-users] Cannot enable Gang scheduling

2023-01-12 Thread Helder Daniel
Hi, I am trying to enable gang scheduling on a server with a CPU with 32 cores and 4 GPUs. However, using Gang sched, the cpu jobs (or gpu jobs) are not being preempted after the time slice, which is set to 30 secs. Below is a snapshot of squeue. There are 3 jobs each needing 32 cores. The

[slurm-users] Regression from slurm-22.05.2 to slurm-22.05.7 when using "--gpus=N" option.

2023-01-12 Thread Rigoberto Corujo
Hello, I have a small 2 compute node GPU cluster, where each node as 2 GPUs. $ sinfo -o "%20N  %10c  %10m  %25f  %30G " NODELIST  CPUS    MEMORY  AVAIL_FEATURES GRES        o186i[126-127]    128 64000   (null) 

Re: [slurm-users] job_container/tmpfs and autofs

2023-01-12 Thread Gizo Nanava
Hello, another workaround could be to use the InitScript=/path/to/script.sh option of the plugin. For example, if user's home directory is under autofs: script.sh: uid=$(squeue -h -O username -j $SLURM_JOB_ID) cd /home/$uid Best regards Gizo > Hi there, > we excitedly found the

Re: [slurm-users] job_container/tmpfs and autofs

2023-01-12 Thread Ole Holm Nielsen
Hi Magnus, We had the same challenge some time ago. A long description of solutions is in my Wiki page at https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#temporary-job-directories The issue may have been solved in https://bugs.schedmd.com/show_bug.cgi?id=12567 which will be

Re: [slurm-users] job_container/tmpfs and autofs

2023-01-12 Thread Ümit Seren
We had the same issue when we switched to job_container plugin. We ended up running cvmfs_cpnfig probe as part of the health check tool so that the cvmfs repos stay mounted. However after we switched on power saving we ran into some race conditions (job landed on a node before the cvmfs was

Re: [slurm-users] job_container/tmpfs and autofs

2023-01-12 Thread Bjørn-Helge Mevik
In my opinion, the problem is with autofs, not with tmpfs. Autofs simply doesn't work well when you are using detached fs name spaces and bind mounting. We ran into this problem years ago (with an inhouse spank plugin doing more or less what tmpfs does), and ended up simply not using autofs. I