[slurm-users] Re: Slurm Cleaning Up $XDG_RUNTIME_DIR Before It Should?

2024-05-15 Thread Ward Poelmans via slurm-users
Hi, This is systemd, not slurm. We've also seen it being created and removed. As far as I understood something about the session that systemd clean up. We've worked around by adding this to the prolog: MY_XDG_RUNTIME_DIR=/dev/shm/${USER} mkdir -p $MY_XDG_RUNTIME_DIR echo "export

[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-26 Thread Ward Poelmans via slurm-users
Hi, On 26/02/2024 09:27, Josef Dvoracek via slurm-users wrote: Are you anybody using something more advanced and still understandable by casual user of HPC? I'm not sure it qualifies but: sbatch --wrap 'screen -D -m' srun --jobid --pty screen -rd Or: sbatch -J screen --wrap 'screen -D

Re: [slurm-users] slurm job_container/tmpfs

2023-11-21 Thread Ward Poelmans
Hi, On 21/11/2023 13:52, Arsene Marian Alain wrote: But how can user write or access the hidden directory .1809 if he doesn't have read/write permission on main directory 1809? Because it works as a namespace. On my side: $ ls -alh /local/6000523/ total 0 drwx-- 3 root root 33

Re: [slurm-users] slurm job_container/tmpfs

2023-11-21 Thread Ward Poelmans
Hi Arsene, On 21/11/2023 10:58, Arsene Marian Alain wrote: I just give my Basepath=/scratch (a local directory for each node that is already mounted with 1777 permissions) in job_container.conf. The plugin automatically generates for each job a directory with the "JOB_ID", for example:

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-10 Thread Ward Poelmans
Hi Ole, On 10/11/2023 15:04, Ole Holm Nielsen wrote: On 11/5/23 21:32, Ward Poelmans wrote: Yes, it's very similar. I've put our systemd unit file also online on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11 This might disturb the logic in waitforib.sh, or at least cause

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-05 Thread Ward Poelmans
com/maxlxl/network.target_wait-for-interfaces ? Thanks, Ole On 11/1/23 20:09, Ward Poelmans wrote: We have a slightly difference script to do the same. It only relies on /sys: # Search for infiniband devices and check waits until # at least one reports that it is ACTIVE if [[ ! -d /sys/class/

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Ward Poelmans
Hi, We have a slightly difference script to do the same. It only relies on /sys: # Search for infiniband devices and check waits until # at least one reports that it is ACTIVE if [[ ! -d /sys/class/infiniband ]] then logger "No infiniband found" exit 0 fi ports=$(ls

Re: [slurm-users] Submitting jobs from machines outside the cluster

2023-08-28 Thread Ward Poelmans
Hi Steven, On 27/08/2023 08:17, Steven Swanson wrote: I'm trying to set up slurm as the backend for a system with Jupyter Notebook-based front end. The jupyter notebooks are running in containers managed by Jupyter Hub, which is a mostly turnkey system for providing docker containers that

Re: [slurm-users] Too many associations

2023-06-30 Thread Ward Poelmans
Hi Gary, On 29/06/2023 22:35, Jackson, Gary L. wrote: A follow-up: As the slurmctld code is written now, it seems that all job submission paths through Slurm get the `job_submit` plugin callback invoked on their behalf, which is great! However, if this is a promise that the API is making,

Re: [slurm-users] Submit sbatch to multiple partitions

2023-04-17 Thread Ward Poelmans
Hi Xaver, On 17/04/2023 11:36, Xaver Stiensmeier wrote: let's say I want to submit a large batch job that should run on 8 nodes. I have two partitions, each holding 4 nodes. Slurm will now tell me that "Requested node configuration is not available". However, my desired output would be that

Re: [slurm-users] Keep CPU Jobs Off GPU Nodes

2023-03-29 Thread Ward Poelmans
Hi, We have a dedicated partitions for GPUs (their name ends with _gpu) and simply forbid a job that is not requesting GPU resources to use this partition: local function job_total_gpus(job_desc) -- return total number of GPUs allocated to the job -- there are many ways to request a

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Ward Poelmans
On 24/02/2023 18:34, David Laehnemann wrote: Those queries then should not have to happen too often, although do you have any indication of a range for when you say "you still wouldn't want to query the status too frequently." Because I don't really, and would probably opt for some compromise of

Re: [slurm-users] Request nodes with a custom resource?

2023-02-06 Thread Ward Poelmans
Hi Xaver, On 6/02/2023 08:39, Xaver Stiensmeier wrote: How would you schedule a job (let's say using srun) to work on these nodes? Of course this would be interesting in a dynamic case, too (assuming that the database is downloaded to nodes during job execution), but for now I would be happy

[slurm-users] GPU: MPS vs Sharding

2023-01-25 Thread Ward Poelmans
Hi, Slurm 22.05 has a new thing called GPU sharding that allows a single GPU to be used by multiple jobs at once. As far as I understood the major difference with the MPS approach is that this should generic (not tied to NVidia technology). Has anyone tried it out? Does it work well? Any

Re: [slurm-users] srun jobfarming hassle question

2023-01-18 Thread Ward Poelmans
On 18/01/2023 15:22, Ohlerich, Martin wrote: But Magnus (Thanks for the Link!) is right. This is still far away from a feature rich job- or task-farming concept, where at least some overview of the passed/failed/missing task statistics is available etc. GNU parallel has log output and

Re: [slurm-users] srun jobfarming hassle question

2023-01-18 Thread Ward Poelmans
Hi Martin, Just a tip: use gnu parallel instead of a for loop. Much easier and more powerful. Like: parallel -j $SLURM_NTASKS srun -N 1 -n 1 -c 1 --exact ::: *.input Ward smime.p7s Description: S/MIME Cryptographic Signature

Re: [slurm-users] ABI Stability

2022-11-30 Thread Ward Poelmans
Hi Michael, On 30/11/2022 07:29, Michael Milton wrote: Considering this, my question is about which APIs (ABI, CLI, other?) are considered stable and worth targeting from a third party application. In addition, is there any initiative to making the ABI stable, because it seems like it would

Re: [slurm-users] Ideal NFS exported StateSaveLocation size.

2022-10-24 Thread Ward Poelmans
On 24/10/2022 09:32, Ole Holm Nielsen wrote: On 10/24/22 06:12, Richard Chang wrote: I have a two node Slurmctld setup and both will mount an NFS exported directory as the state save location. It is definitely a BAD idea to store Slurm StateSaveLocation on a slow NFS directory!  SchedMD

Re: [slurm-users] How to bind GPUs with CPU cores

2022-10-14 Thread Ward Poelmans
Hi William, On 14/10/2022 11:41, William Zhang wrote: How to realize this function . For example , A job requires 6 CPUs with 1 GPU .And it runs on gpu ID 0 , CPU ID 0-5 . The second job requires 8 CPUs with 1 GPU . If it runs on gpu ID 1 ,we hope the CPU ID is 16-23. The third job requires

Re: [slurm-users] Detecting non-MPI jobs running on multiple nodes

2022-09-29 Thread Ward Poelmans
Hi Loris, On 29/09/2022 09:26, Loris Bennett wrote: I can see that this is potentially not easy, since an MPI job might have still have phases where only one core is actually being used. Slurm will create the needed cgroups on all the nodes that are part of the job when the job starts. So

Re: [slurm-users] Multiple Program Runs using srun in one Slurm batch Job on one node

2022-06-15 Thread Ward Poelmans
Hi Guillaume, On 15/06/2022 16:59, Guillaume De Nayer wrote: Perhaps I missunderstand the Slurm documentation... As thought that the --exclusive option used in combination with sbatch will reserve the whole node (40 cores) for the job (submitted with sbatch). This part is working fine. I can

Re: [slurm-users] How do you make --export=NONE the default behavior for our cluster?

2022-06-04 Thread Ward Poelmans
Hi, We're using a cli filter for doing this. But it's more tricky then just `--export=NONE`. For a srun inside a sbatch, you want `--export=ALL` again because MPI will break otherwise. We have this in our cli filter: function slurm_cli_pre_submit(options, pack_offset) local

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Ward Poelmans
Hi Steven, I think truly dynamic adding and removing of nodes is something that's on the roadmap for slurm 23.02? Ward On 5/05/2022 15:28, Steven Varga wrote: Hi Tina, Thank you for sharing. This matches my observations when I checked if slurm could do what I am upto: manage AWS EC2

Re: [slurm-users] Disable exclusive flag for users

2022-03-25 Thread Ward Poelmans
Hi PVD, On 25/03/2022 01:55, pankajd wrote: We have slurm 21.08.6 and GPUs in our compute nodes. We want to restrict / disable the use of "exclusive" flag in srun for users. How should we do it? You can check for the flag in the job_submit.lua plugin and reject it if it's used while also

Re: [slurm-users] Where is the documentation for saving batch script

2022-03-17 Thread Ward Poelmans
Hi Jeff, On 17/03/2022 15:39, Jeffrey R. Lang wrote:   I want to look into the new feature of saving job scripts in the Slurm database but have been unable to find documentation on how to do it.   Can someone please point me in the right direction for the documentation or slurm

Re: [slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5

2022-02-10 Thread Ward Poelmans
Hi Paul, On 10/02/2022 14:33, Paul Brunk wrote: Now we see a problem in which the OOM killer is in some cases predictably killing job steps who don't seem to deserve it.  In some cases these are job scripts and input files which ran fine before our Slurm upgrade.  More details follow, but

Re: [slurm-users] problem with "configless" slurm.conf

2021-07-20 Thread Ward Poelmans
Hi, On 20/07/2021 16:01, Durai Arasan wrote: > > This is limited to this one node only. Do you know how to fix this? I already > tried restarting the slurmd service on this node. Is the node properly definied in the slurm.conf and do the DNS hostname work? scontrol show node slurm-bm-70

Re: [slurm-users] MinJobAge

2021-07-06 Thread Ward Poelmans
On 6/07/2021 14:59, Emre Brookes wrote: > I'm using slurm 20.02.7 & have the same issue (except I am running batch > jobs). > Does MinJobAge work to keep completed jobs around for the specified duration  > in squeue output? It does for me if I do 'squeue -t all'. This is slurm 20.11.7. Ward

Re: [slurm-users] How to avoid a feature?

2021-07-02 Thread Ward Poelmans
Hi Tina, On 2/07/2021 13:42, Tina Friedrich wrote: > We did think about having 'hidden' GPU partitions instead of wrangling it > with features, but there didn't seem to be any benefit to that that we could  > see. The benefit with partitions is that you can set a bunch of options that are not

Re: [slurm-users] Slurm stats in JSON format

2021-06-08 Thread Ward Poelmans
On 8/06/2021 00:27, Sid Young wrote: > Is there a tool that will extract the job counts in JSON format? Such as > #running, #in pending #onhold etc > > I am trying to build some custom dashboards for the our new cluster and this > would be a really useful set of metrics to gather and display.

Re: [slurm-users] slurmrestd

2021-06-07 Thread Ward Poelmans
Hi, On 7/06/2021 04:33, David Schanzenbach wrote: > In our .rpmmacros file we use, the following option is set: > %_with_slurmrestd 1 You also need libjwt: https://bugs.schedmd.com/show_bug.cgi?id=4 Ward

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-27 Thread Ward Poelmans
On 27/05/2021 08:19, Loris Bennett wrote: > Thanks for the detailed explanations. I was obviously completely > confused about what MUNGE does. Would it be possible to say, in very > hand-waving terms, that MUNGE performs a similar role for the access of > processes to nodes as SSH does for the

Re: [slurm-users] Slurm reservation for migrating user home directories

2021-04-16 Thread Ward Poelmans
Hi Ole, On 16/04/2021 14:23, Ole Holm Nielsen wrote: > Question:  Does anyone have experiences with this type of scenario?  Any > good ideas or suggestions for other methods for data migration? We once did something like that. Basically it did something like that: - Process is kicked off per

Re: [slurm-users] slurmrestd configuration

2021-04-12 Thread Ward Poelmans
Hi Simone, On 9/04/2021 18:03, Simone Riggi wrote: > All of them are working. > So in this case the only requirement for a user is having the read/write > permission on the socket?  Correct. The authentication is done as you know the user with a socket. > My goal at the end would be to let a

Re: [slurm-users] slurmrestd configuration

2021-04-09 Thread Ward Poelmans
Hi Simone, On 8/04/2021 23:23, Simone Riggi wrote: > $ scontrol token lifespan=7200 username=riggi > > How can I configure and test the other auth method (local)? I am using > jwt at the moment. > I would like a user to be always authorized to use the rest API.  local means socket (so you don't

Re: [slurm-users] slurmrestd configuration

2021-04-08 Thread Ward Poelmans
Hi Simone, On 8/04/2021 15:53, Simone Riggi wrote: > - I see effectively that --with jwt is not listed. I wonder how to build > (using rpmbuild) slurm auth plugins?  >   In general I didn't understand from the doc what plugins slurmrestd > expects by default and where it searches it. From -a

Re: [slurm-users] slurmrestd configuration

2021-04-08 Thread Ward Poelmans
Hi Ole, On 8/04/2021 10:09, Ole Holm Nielsen wrote: > On 4/8/21 9:50 AM, Simone Riggi wrote: >> >> rpmbuild -ta slurm-20.11.5.tar.bz2 --with mysql --with slurmrestd >> --with jwt > > I don't see this "--with jwt" in the slurm.spec file: It's not yet there:

Re: [slurm-users] slurmrestd configuration

2021-04-08 Thread Ward Poelmans
Hi Simone, On 8/04/2021 09:50, Simone Riggi wrote: > where /etc/slurm/slurmrestd.conf > > include /etc/slurm/slurm.conf > AuthType=auth/jwt Did you add a key? AuthAltParameters=jwt_key=/etc/slurm/jwt.key It needs to be present on the slurmdbd and slurmctld nodes. Ward

Re: [slurm-users] [External] Submitting to multiple paritions problem with gres specified

2021-03-09 Thread Ward Poelmans
Hi Prentice, On 8/03/2021 22:02, Prentice Bisbal wrote: > I have a very hetergeneous cluster with several different generations of > AMD and Intel processors, we use this method quite effectively. Could you elaborate a bit more and how you manage that? Do you force you users to pick a feature?

Re: [slurm-users] Get original script of a job

2021-03-05 Thread Ward Poelmans
Hi, On 5/03/2021 11:29, Alberto Morillas, Angelines wrote: > I know that when I send a job with scontroI can get the path and the > name of the script used to send this job, but normally the users change > theirs scripts and sometimes all was wrong after that, so is there any > possibility to

Re: [slurm-users] slurm/munge problem: invalid credentials

2020-12-16 Thread Ward Poelmans
On 15/12/2020 17:48, Olaf Gellert wrote: So munge seems to work as far as I can say. What else does slurm using munge? Are hostnames part of the authentication? Do I have to wonder about the time "Thu Jan 01 01:00:00 1970" I'm not an expert but I know that hostnames are part of munge