Re: [slurm-users] Why does Slurm kill one particular user's jobs after a few seconds?
Hi Ole Thanks for the suggestion. I am afraid the solution is not the same. At least, restarting `slurmdbd` and `slurmctld` on the head node has made no difference either. It puzzles me why Slurm appears to treat this one user differently than all others. Even other users under the same account are doing fine. I think the possible relation to another (array) job I was speculating about in my original message was just coincidental. I have now tried the following three steps in the hope of somehow fixing the problem, none of which have changed the situation: - Deleted the user from Slurm using `sacctmgr remove user` and re-created the user again afterwards. - Removed the user's home directory and let the login procedure populate a new home directory from scratch for the user. - Restarted `slurmdbd` and `slurmctld` as mentioned above. Thomas
Re: [slurm-users] AutoDetect=nvml throwing an error message
Hi Michael, Thanks, Indeed I don't have it. Slurm must have not detected it. I double checked and NVML is installed (libnvidia-ml-dev for Ubuntu) Here is some output, including the relevant paths for nvml. Is it possible to tell the slurm compilation to check these paths for nvml ? best *NVML PKG CHECK* ➜ ~ sudo apt search nvml Sorting... Done Full Text Search... Done cuda-nvml-dev-11-0/unknown 11.0.167-1 amd64 NVML native dev links, headers cuda-nvml-dev-11-1/unknown,unknown 11.1.74-1 amd64 NVML native dev links, headers cuda-nvml-dev-11-2/unknown,unknown 11.2.152-1 amd64 NVML native dev links, headers *libnvidia-ml-dev/focal,now 10.1.243-3 amd64 [installed] NVIDIA Management Library (NVML) development files* python3-pynvml/focal 7.352.0-3 amd64 Python3 bindings to the NVIDIA Management Library *NVML Shared library location* ➜ ~ find /usr/lib | grep libnvidia-ml /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.102.04 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so *NVML Header* ➜ ~ find /usr | grep nvml /usr/include/nvml.h *SLURM LIBS* ➜ ~ ls /usr/lib64/slurm/ accounting_storage_mysql.so*core_spec_none.so* job_submit_pbs.so* proctrack_pgid.so* accounting_storage_none.so* cred_munge.so* job_submit_require_timelimit.so*route_default.so* accounting_storage_slurmdbd.so* cred_none.so* job_submit_throttle.so* route_topology.so* acct_gather_energy_ibmaem.so* ext_sensors_none.so* launch_slurm.so*sched_backfill.so* acct_gather_energy_ipmi.so* gpu_generic.so* mcs_account.so* sched_builtin.so* acct_gather_energy_none.so* gres_gpu.so* mcs_group.so* sched_hold.so* acct_gather_energy_pm_counters.so* gres_mic.so* mcs_none.so*select_cons_res.so* acct_gather_energy_rapl.so* gres_mps.so* mcs_user.so*select_cons_tres.so* acct_gather_energy_xcc.so* gres_nic.so* mpi_none.so*select_linear.so* acct_gather_filesystem_lustre.so* jobacct_gather_cgroup.so* mpi_pmi2.so*site_factor_none.so* acct_gather_filesystem_none.so* jobacct_gather_linux.so* mpi_pmix.so@slurmctld_nonstop.so* acct_gather_interconnect_none.so* jobacct_gather_none.so* mpi_pmix_v2.so* src/ acct_gather_interconnect_ofed.so* jobcomp_elasticsearch.so* node_features_knl_generic.so* switch_none.so* acct_gather_profile_hdf5.so*jobcomp_filetxt.so* power_none.so* task_affinity.so* acct_gather_profile_influxdb.so*jobcomp_lua.so* preempt_none.so*task_cgroup.so* acct_gather_profile_none.so*jobcomp_mysql.so* preempt_partition_prio.so* task_none.so* auth_munge.so* jobcomp_none.so* preempt_qos.so* topology_3d_torus.so* burst_buffer_generic.so*jobcomp_script.so* prep_script.so* topology_hypercube.so* cli_filter_lua.so* job_container_cncu.so* priority_basic.so* topology_none.so* cli_filter_none.so* job_container_none.so* priority_multifactor.so*topology_tree.so* cli_filter_syslog.so* job_submit_all_partitions.so* proctrack_cgroup.so* cli_filter_user_defaults.so*job_submit_lua.so* proctrack_linuxproc.so* On Thu, Apr 15, 2021 at 9:02 AM Michael Di Domenico wrote: > the error message sounds like when you built the slurm source it > wasn't able to find the nvml devel packages. if you look in where you > installed slurm, in lib/slurm you should have a gpu_nvml.so. do you? > > On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro > wrote: > > > > typing error, should be --> **located at /usr/include/nvml.h** > > > > On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro < > cristobal.navarr...@gmail.com> wrote: > >> > >> Hi community, > >> I have set up the configuration files as mentioned in the > documentation, but the slurmd of the GPU-compute node fails with the > following error shown in the log. > >> After reading the slurm documentation, it is not entirely clear to me > how to properly set up GPU autodetection for the gres.conf file as it does > not mention if the nvml detection should be automatic or not. > >> I have also read the top google searches including > https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html > but that was a problem of a cuda installation overwritten (not my case). > >> This a DGX A100 node that comes with the Nvidia driver installed and > nvml is located at /etc/include/nvml.h, not sure if there is a libnvml.so > or similar as well. > >> How to tell SLURM to look at those paths? any ideas of experience > sharing is welcome. > >> best > >> > >> > >> slurmd.log (GPU node) > >> [2021-04-14T17:31:42.302] got shutdown request > >> [2021-04-1
[slurm-users] NHC and slurm
Hello, I'm trying to setup NHC[0] for our Slurm cluster, but I'm not getting it to work properly. I'm using the dev branch from [0] and compiled it this way: $ ./autogen.sh --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/lib $ make test $ sudo make install When I run nhc, I get an error that sshd is not running: $ sudo nhc ERROR: nhc: Health check failed: check_ps_service: Service sshd (process sshd) owned by root not running I know sshd is running because I logged in this machine with ssh. And `systemctl status sshd` shows it is active. Here's a sample of my nhc.conf: * || check_ps_service munged * || check_ps_service -u root sshd * || check_ps_service -u root ssh * || check_ps_service ssh * || check_ps_service sshd If I run `sudo nhc -a` to run all the tests, it gives 4 errors about ssh. NHC can find munge running, so what's the problem with ssh? What am I missing? I'm using Ubuntu 20.04. Cheers, Heitor [0] https://github.com/mej/nhc/ pgp5KTuJw6y_H.pgp Description: OpenPGP digital signature
Re: [slurm-users] AutoDetect=nvml throwing an error message
the error message sounds like when you built the slurm source it wasn't able to find the nvml devel packages. if you look in where you installed slurm, in lib/slurm you should have a gpu_nvml.so. do you? On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro wrote: > > typing error, should be --> **located at /usr/include/nvml.h** > > On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro > wrote: >> >> Hi community, >> I have set up the configuration files as mentioned in the documentation, but >> the slurmd of the GPU-compute node fails with the following error shown in >> the log. >> After reading the slurm documentation, it is not entirely clear to me how to >> properly set up GPU autodetection for the gres.conf file as it does not >> mention if the nvml detection should be automatic or not. >> I have also read the top google searches including >> https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html >> but that was a problem of a cuda installation overwritten (not my case). >> This a DGX A100 node that comes with the Nvidia driver installed and nvml is >> located at /etc/include/nvml.h, not sure if there is a libnvml.so or similar >> as well. >> How to tell SLURM to look at those paths? any ideas of experience sharing is >> welcome. >> best >> >> >> slurmd.log (GPU node) >> [2021-04-14T17:31:42.302] got shutdown request >> [2021-04-14T17:31:42.302] all threads complete >> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open >> '(null)/tasks' for reading : No such file or directory >> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of >> '(null)' >> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open >> '(null)/tasks' for reading : No such file or directory >> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of >> '(null)' >> [2021-04-14T17:31:42.304] debug: gres/gpu: fini: unloading >> [2021-04-14T17:31:42.304] debug: gpu/generic: fini: fini: unloading GPU >> Generic plugin >> [2021-04-14T17:31:42.304] select/cons_tres: common_fini: select/cons_tres >> shutting down ... >> [2021-04-14T17:31:42.304] debug2: spank: spank_pyxis.so: slurmd_exit = 0 >> [2021-04-14T17:31:42.304] cred/munge: fini: Munge credential signature >> plugin unloaded >> [2021-04-14T17:31:42.304] Slurmd shutdown completing >> [2021-04-14T17:31:42.321] debug: Log file re-opened >> [2021-04-14T17:31:42.321] debug2: hwloc_topology_init >> [2021-04-14T17:31:42.321] debug2: hwloc_topology_load >> [2021-04-14T17:31:42.440] debug2: hwloc_topology_export_xml >> [2021-04-14T17:31:42.446] Considering each NUMA node as a socket >> [2021-04-14T17:31:42.446] debug: CPUs:256 Boards:1 Sockets:8 >> CoresPerSocket:16 ThreadsPerCore:2 >> [2021-04-14T17:31:42.446] debug: Reading cgroup.conf file >> /etc/slurm/cgroup.conf >> [2021-04-14T17:31:42.447] debug2: hwloc_topology_init >> [2021-04-14T17:31:42.447] debug2: xcpuinfo_hwloc_topo_load: xml file >> (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found >> [2021-04-14T17:31:42.448] Considering each NUMA node as a socket >> [2021-04-14T17:31:42.448] debug: CPUs:256 Boards:1 Sockets:8 >> CoresPerSocket:16 ThreadsPerCore:2 >> [2021-04-14T17:31:42.449] GRES: Global AutoDetect=nvml(1) >> [2021-04-14T17:31:42.449] debug: gres/gpu: init: loaded >> [2021-04-14T17:31:42.449] fatal: We were configured to autodetect nvml >> functionality, but we weren't able to find that lib when Slurm was >> configured. >> >> >> >> gres.conf (just AutoDetect=nvml) >> ➜ ~ cat /etc/slurm/gres.conf >> # GRES configuration for native GPUS >> # DGX A100 8x Nvidia A100 >> # not working, slurm cannot find nvml >> AutoDetect=nvml >> #Name=gpu File=/dev/nvidia[0-7] >> #Name=gpu Type=A100 File=/dev/nvidia[0-7] >> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7 >> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15 >> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23 >> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31 >> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39 >> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47 >> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55 >> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63 >> >> >> slurm.conf >> GresTypes=gpu >> AccountingStorageTRES=gres/gpu >> DebugFlags=CPU_Bind,gres >> >> ## We don't want a node to go back in pool without sys admin acknowledgement >> ReturnToService=0 >> >> ## Basic scheduling >> #SelectType=select/cons_res >> SelectType=select/cons_tres >> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE >> SchedulerType=sched/backfill >> >> TaskPlugin=task/cgroup >> ProctrackType=proctrack/cgroup >> >> ## Nodes list >> ## use native GPUs >> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2 >> RealMemory=1024000 State=UNKNOWN Gres=gpu:8 Feature=ht,gpu >> >> ## Partitions list >> PartitionName=gpu OverSubscribe=FORCE DefCpuPerGPU=8 MaxTime=INFINITE >> State=UP Nodes=nodeGPU01 Default=YES >> PartitionName=cpu OverSubscribe=FORCE Ma
Re: [slurm-users] GRES Restrictions
Hello, is there a best practise for activating this feature (set ConstrainDevices=yes)? Do I have restart the slurmds? Does this affects running jobs? We are using Slurm 19.05. Best, Stefan Am Dienstag, 25. August 2020, 17:24:41 CEST schrieb Christoph Brüning: > Hello, > > we're using cgroups to restrict access to the GPUs. > > What I found particularly helpful, are the slides by Marshall Garey from > last year's Slurm User Group Meeting: > https://slurm.schedmd.com/SLUG19/cgroups_and_pam_slurm_adopt.pdf > (NVML didn't work for us for some reason I cannot recall, but listing > the GPU device files explicitly was not a big deal) > > Best, > Christoph > > On 25/08/2020 16.12, Willy Markuske wrote: > > Hello, > > > > I'm trying to restrict access to gpu resources on a cluster I maintain > > for a research group. There are two nodes put into a partition with gres > > gpu resources defined. User can access these resources by submitting > > their job under the gpu partition and defining a gres=gpu. > > > > When a user includes the flag --gres=gpu:# they are allocated the number > > of gpus and slurm properly allocates them. If a user requests only 1 gpu > > they only see CUDA_VISIBLE_DEVICES=1. However, if a user does not > > include the --gres=gpu:# flag they can still submit a job to the > > partition and are then able to see all the GPUs. This has led to some > > bad actors running jobs on all GPUs that other users have allocated and > > causing OOM errors on the gpus. > > > > Is it possible, and where would I find the documentation on doing so, to > > require users to define a --gres=gpu:# to be able to submit to a > > partition? So far reading the gres documentation doesn't seem to have > > yielded any word on this issue specifically. > > > > Regards, -- Stefan Stäglich, Universität Freiburg, Institut für Informatik Georges-Köhler-Allee, Geb.52, 79110 Freiburg,Germany E-Mail : staeg...@informatik.uni-freiburg.de WWW: gki.informatik.uni-freiburg.de Telefon: +49 761 203-54216 Fax: +49 761 203-8222
Re: [slurm-users] Why does Slurm kill one particular user's jobs after a few seconds?
Hi Thomas, I wonder if your problem is related to that reported in this list thread? https://lists.schedmd.com/pipermail/slurm-users/2021-April/007107.html You could try to restart the slurmctld service, and also make sure your configuration (slurm.conf etc.) has been pushed correctly to the slurmd nodes. /Ole On 4/14/21 9:53 AM, Thomas Arildsen wrote: Oh and I forgot to mention that we are using Slurm version 20.11.3. Best, Thomas ons, 14 04 2021 kl. 09:23 +0200, skrev Thomas Arildsen: I administer a Slurm cluster with many users and the operation of the cluster currently appears "totally normal" for all users; except for one. This one user gets all attempts to run commands through Slurm killed after 20-25 seconds (I think the cause is another job - not so much the time, see further down). The following minimal example reproduces the error: $ sudo -u srun --pty sleep 25 srun: job 110962 queued and waiting for resources srun: job 110962 has been allocated resources srun: Force Terminated job 110962 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 110962.0 ON CANCELLED AT 2021- 04-09T16:33:35 *** srun: error: : task 0: Terminated When this happens, I find this line in the slurmctld log: _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=110962 uid It only happens for '' and not for any other user that I know of. This very similar but shorter-running example works fine: $ sudo -u srun --pty sleep 20 srun: job 110963 queued and waiting for resources srun: job 110963 has been allocated resources Note that when I run srun --pty sleep 20 as myself, srun does not output the two srun: job... lines. This seems to me to be an additional indication that srun is subject to some different settings for ''. All settings that I have been able to inspect appear identical for '' as for other users. I have checked, and 'MaxWall' is not set for this user and not for any other user, either. Other users belonging to the same Slurm account do not experience this problem. When this unfortunate user's jobs get allocated, I see messages like this in '/var/log/slurm/slurmctld.log': sched: _slurm_rpc_allocate_resources JobId=111855 NodeList= and shortly after, I see this message: select/cons_tres: common_job_test: no job_resources info for JobId=110722_* rc=0 Job 110722_* is a pending array job by another user that is pending due to 'QOSMaxGRESPerUser'. One pending part of this array job (110722_57) eventually ends up taking over job 111855's CPU cores when 111855 gets killed. This leads me to believe that 110722_57 causes 111855 to be killed. However, 110722_57 remains pending afterwards. Some of the things I fail to understand here are: - Why does a pending job kill another job, yet remains pending afterwards? - Why does the pending job even have privileges to kill another job in the first place? - Why does this only affect ''s jobs but not those of other users? None of this is intended to happen. I am guessing it must be caused by some settings specific to '', but I cannot figure out what they are and they are not supposed to be like this. If these are settings we admins somehow caused, it was unintended. NB: some details have been anonymized as above. I hope someone has a clue what is going on here. Thanks in advance, Thomas Arildsen